<<

Towards biologically plausible gradient descent

by

Jordan Guerguiev

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Cell and Systems Biology

c Copyright 2021 by Jordan Guerguiev Abstract

Towards biologically plausible gradient descent

Jordan Guerguiev Doctor of Philosophy Graduate Department of Cell and Systems Biology University of Toronto 2021

Synaptic plasticity is the primary physiological mechanism underlying learning in the brain. It is de- pendent on pre- and post-synaptic neuronal activities, and can be mediated by neuromodulatory signals.

However, to date, computational models of learning that are based on pre- and post-synaptic activity and/or global neuromodulatory reward signals for plasticity have not been able to learn complex tasks that animals are capable of. In the machine learning field, neural network models with many layers of computations trained using gradient descent have been highly successful in learning difficult tasks with near-human level performance. To date, it remains unclear how gradient descent could be implemented in neural circuits with many layers of synaptic connections. The overarching goal of this thesis is to develop theories for how the unique properties of neurons can be leveraged to enable gradient descent in deep circuits and allow them to learn complex tasks.

The work in this thesis is divided into three projects. The first project demonstrates that networks of cortical pyramidal neurons, which have segregated apical dendrites and exhibit bursting behavior driven by dendritic plateau potentials, can in theory leverage these physiological properties to approximate gradient descent through multiple layers of synaptic connections. The second project presents a theory for how ensembles of pyramidal neurons can multiplex sensory and learning signals using bursting and short-term plasticity, in order to approximate gradient descent and learn complex visual recognition tasks that previous biologically inspired models have struggled with. The final project focuses on the fact that machine learning models implementing gradient descent assume symmetric feedforward and feedback weights, and presents a theory for how the spiking properties of neurons can enable them to align feedforward and feedback weights in a network.

As a whole, this work aims to bridge the gap between powerful algorithms developed in the machine learning field and our current understanding of learning in the brain. To this end, we develop novel theories into how neuronal circuits in the brain can coordinate the learning of complex tasks, and present a number of experimental predictions that are fruitful avenues for future experimental research.

ii Acknowledgements

I would like to extend my deep appreciation to my supervisor, Blake Richards, for putting his trust in me as one of his first Ph.D. students, and for providing me with boundless knowledge, support and encouragement that propelled me through this work. In addition, I am grateful for the help of my collaborators on the work presented here – Timothy Lillicrap, Alexandre Payeur, Friedemann Zenke, Richard Naud and Konrad Kording. I would also like to thank Thomas Mesnard, for his valuable help and advice along the way.

I would also like to thank my lab mates and friends I have made throughout the years, including Matt, Annik, Kirthana, Danny, Colleen and Luke, for sharing this experience with me, and bringing me countless moments of comfort, joy and laughter.

A special thanks to my committee members, Melanie Woodin, Frances Skinner, and Douglas Tweed, for giving me an abundance of valuable advice and suggestions that have helped me improve as a scien- tist, and shaped this body of work into what it is today.

I would like to thank Mao for her endless love, positivity and encouragement, for which I am forever grateful. Finally, I want to thank my parents, for the many sacrifices they have made to get me to this moment, and my sister, for always supporting me and being my mentor in life.

iii Contents

1 Introduction 1 1.1 Research contributions and thesis outline ...... 3

2 Background 5 2.1 Learning in the brain ...... 5 2.2 Biological neurons ...... 5 2.2.1 Inhibitory interneurons ...... 5 2.2.2 Pyramidal neurons ...... 6 2.3 ...... 7 2.3.1 Short-term plasticity ...... 7 2.3.2 Long-term plasticity ...... 7 2.3.3 Hebbian plasticity, neuromodulation and synaptic tagging ...... 8 2.3.4 Spike timing dependent plasticity ...... 9 2.4 Machine learning ...... 9 2.4.1 Artificial neural networks ...... 10 2.4.2 Gradient descent ...... 10 2.4.3 Backpropagation of error (backprop) ...... 12 2.4.4 Convolutional neural networks ...... 12 2.5 Weight symmetry ...... 13 2.5.1 Feedback alignment ...... 13 2.5.2 Kolen-Pollack algorithm ...... 14 2.5.3 Weight mirroring ...... 14 2.6 Related models of biologically plausible gradient descent ...... 15 2.6.1 Contrastive Hebbian learning ...... 15 2.6.2 Equilibrium propagation ...... 15 2.6.3 Difference target propagation ...... 16 2.6.4 Dendritic prediction learning ...... 16 2.6.5 Dendritic error backpropagation ...... 17 2.6.6 Updated random feedback ...... 17 2.6.7 Burst ensemble multiplexing ...... 18 2.7 Project synopses ...... 18 2.7.1 Project 1: Towards deep learning with segregated dendrites ...... 18

iv 2.7.2 Project 2: Burst-dependent synaptic plasticity can coordinate learning in hierar- chical circuits ...... 19 2.7.3 Project 3: Spike-based causal inference for weight alignment ...... 19

3 Project 1: Towards deep learning with segregated dendrites 20 3.1 Abstract ...... 20 3.2 Author contributions ...... 21 3.3 Introduction ...... 21 3.4 Results ...... 24 3.4.1 A network architecture with segregated dendritic compartments ...... 24 3.4.2 Calculating credit assignment signals with feedback driven plateau potentials . . . 28 3.4.3 Co-ordinating optimization across layers with feedback to apical dendrites . . . . . 30 3.4.4 Deep learning with segregated dendrites ...... 33 3.4.5 Coordinated local learning mimics backpropagation of error ...... 35 3.4.6 Conditions on feedback weights ...... 37 3.4.7 Learning with partial apical attenuation ...... 38 3.5 Discussion ...... 39 3.6 Methods ...... 45 3.6.1 Neuronal dynamics ...... 45 3.6.2 Plateau potentials ...... 47 3.6.3 Weight updates ...... 47 3.6.4 Multiple hidden layers ...... 50 3.6.5 Learning rate optimization ...... 51 3.6.6 Training paradigm ...... 51 3.6.7 Simulation details ...... 52 3.7 Acknowledgments ...... 52

4 Project 2: Burst-dependent synaptic plasticity can coordinate learning in hierarchi- cal circuits 54 4.1 Abstract ...... 55 4.2 Author contributions ...... 55 4.3 Introduction ...... 55 4.4 Results ...... 57 4.4.1 A burst-dependent rule enables top-down steering of plasticity ...... 57 4.4.2 Dendrite-dependent bursting combined with short-term plasticity supports multi- plexing of feedforward and feedback signals ...... 60 4.4.3 Combining a burst-dependent plasticity rule with short-term plasticity and apical dendrites can solve the credit assignment problem ...... 61 4.4.4 Burst-dependent plasticity promotes linearity and alignment of feedback ...... 64 4.4.5 Ensemble-level burst-dependent plasticity in deep networks can support good per- formance on standard machine learning benchmarks ...... 66 4.5 Discussion ...... 69 4.6 Methods ...... 72 4.6.1 Spiking model ...... 72

v 4.6.2 Deep network model for categorical learning ...... 76 4.7 Acknowledgments ...... 79 4.8 Code availability ...... 80

5 Project 3: Spike-based causal inference for weight alignment 81 5.1 Abstract ...... 81 5.2 Author contributions ...... 82 5.3 Introduction ...... 82 5.4 Related work ...... 84 5.5 Our contributions ...... 84 5.6 Methods ...... 85 5.6.1 General approach ...... 85 5.6.2 RDD feedback training phase ...... 85 5.6.3 LIF dynamics ...... 86 5.6.4 RDD algorithm ...... 86 5.7 Results ...... 87 5.7.1 Alignment of feedback and feedforward weights ...... 87 5.7.2 Descending the symmetric alignment cost function ...... 87 5.7.3 Performance on Fashion-MNIST, SVHN, CIFAR-10 and VOC ...... 88 5.8 Discussion ...... 88

6 Discussion 92 6.1 Challenges and limitations ...... 93 6.1.1 Project 1: Towards deep learning with segregated dendrites ...... 93 6.1.2 Project 2: Burst-dependent synaptic plasticity can coordinate learning in hierar- chical circuits ...... 94 6.1.3 Project 3: Spike-based causal inference for weight alignment ...... 95 6.2 Experimental predictions ...... 96 6.2.1 Project 1: Towards deep learning with segregated dendrites ...... 96 6.2.2 Project 2: Burst-dependent synaptic plasticity can coordinate learning in hierar- chical circuits ...... 96 6.2.3 Project 3: Spike-based causal inference for weight alignment ...... 97 6.3 Future directions ...... 97 6.3.1 Learning modalities and network architectures ...... 97 6.3.2 Teaching and neuromodulatory signals ...... 98 6.3.3 Neuron types, connectivity motifs and cortical layers ...... 98 6.4 Concluding remarks ...... 99

Glossary 100

Bibliography 100

vi A Appendix for Project 1 119 A.1 Proofs ...... 119 A.1.1 Theorem for loss function coordination ...... 119 A.1.2 Hidden layer targets ...... 121 A.1.3 Lemma for firing rates ...... 121 A.2 Supplemental figures ...... 127

B Appendix for Project 2 129 B.1 Backprop ...... 129 B.2 Quasi-static burstprop ...... 130 B.2.1 Derivation ...... 130 B.3 Time-dependent rate model ...... 134 B.3.1 Dynamics ...... 134 B.3.2 Limiting case ...... 136 B.4 Linking the rate-based and spike-based models ...... 137 B.4.1 Linking the learning rules ...... 139 B.5 Models trained on MNIST, CIFAR-10 and ImageNet ...... 140 B.5.1 Model architectures ...... 140 B.5.2 Activation functions, burst probabilities and weight update rules ...... 141 B.5.3 Training details ...... 142 B.5.4 Hyperparameter optimization ...... 142 B.6 Supplemental figures ...... 143

C Appendix for Project 3 152 C.1 LIF neuron simulation details ...... 152 C.2 RDD feedback training implementation ...... 152 C.2.1 Weight scaling ...... 152 C.2.2 Feedback training paradigm ...... 153 C.3 Network and training details ...... 153 C.4 Akrout et al. algorithm implementation ...... 153 C.5 Supplemental figures ...... 153

vii List of Tables

A.1 List of parameter values used in simulations in Project 1 ...... 126

B.1 Comparison of bio-inspired credit-assignment algorithms ...... 131 B.2 Summary of the main equations of burstprop and backprop ...... 134 B.3 Network architectures used to train on MNIST, as well as CIFAR-10 and ImageNet exper- iments using backprop, node pertubation and burstprop with learned feedback weights, in Figs. 4.6 and B.8...... 140 B.4 Network architectures used to train on CIFAR-10 and ImageNet with fixed feedback weights in Fig. 4.6...... 140

C.1 Network architectures used in Project 3 ...... 153

viii List of Figures

1.1 The credit assignment problem in biological circuits involved in visual processing . . . . . 2

2.1 Feed-forward artificial neural networks ...... 11

3.1 The credit assignment problem in multi-layer neural networks ...... 22 3.2 Potential solutions to credit assignment using top-down feedback ...... 23 3.3 Illustration of a multi-compartment neural network model for deep learning ...... 25 3.4 Illustration of network phases for learning ...... 29 3.5 Co-ordinated errors between the output and hidden layers ...... 32 3.6 Improvement of learning with hidden layers ...... 34 3.7 Approximation of backpropagation with local learning rules ...... 36 3.8 Conditions on feedback synapses for effective learning ...... 38 3.9 Importance of dendritic segregation for deep learning ...... 39 3.10 An experiment to test the central prediction of the model ...... 42

4.1 The credit assignment problem for hierarchical networks ...... 57 4.2 Burst-dependent plasticity rule ...... 58 4.3 Dendrite-dependent bursting combined with short-term plasticity supports the simulta- neous propagation of bottom-up and top-down signals ...... 62 4.4 Burst-dependent plasticity can solve the credit assignment problem for the XOR task . . 63 4.5 Burst-dependent plasticity of recurrent and feedback connections promotes gradient-based learning by linearizing and aligning feedback ...... 67 4.6 Ensemble-level burst-dependent plasticity supports learning in deep networks ...... 68

5.1 Illustration of weight symmetry in a neural network with feedforward and feedback con- nections ...... 83 5.2 Outline of the feedback weight learning algorithm ...... 89 5.3 Feedback weights become aligned with feedforward weights during training ...... 90

5.4 Evolution of Rself during training ...... 91 5.5 Results of training on Fashion-MNIST, SVHN, CIFAR-10 and VOC ...... 91

A.1 Weight alignment during first epoch of training ...... 127 A.2 Learning with stochastic plateau times ...... 128 A.3 Importance of weight magnitudes for learning with sparse weights ...... 128

B.1 Comparison of costs for the XOR task ...... 143

ix B.2 Comparison of costs for different example durations and moving average time constants . 144 B.3 Output-layer activity for the XOR task ...... 144 B.4 Learning MNIST with the time-dependent rate model ...... 145 B.5 Network mechanisms regulating the bursting nonlinearity ...... 146 B.6 The bursting nonlinearity controls the learning rate ...... 147 B.7 Linearity of feedback signals degrades with depth in deep convolutional network trained on ImageNet ...... 148 B.8 Learning MNIST with the simplified rate model ...... 149 B.9 The variance of the burst probability decreases during learning ...... 149 B.10 Recurrent short-term facilitating (STF) inhibitory connections within a pyramidal neuron population help disambiguate events and bursts ...... 150 B.11 Comparison between direct transmission of events and bursts and transmission mediated by short-term plasticity ...... 151

C.1 Comparison of average spike rates in the fully-connected layers of the LIF network vs. activities of the same layers in the convolutional network, when both sets of layers were fed the same input ...... 154

x Chapter 1

Introduction

Synaptic plasticity is believed to be a key mechanism underlying learning and memory in the brain [1, 2]. In particular, plasticity in the form of long-term potentiation (LTP) and long-term depression (LTD) is implicated in the learning of behavioral tasks [3, 4, 5, 6, 7]. LTP and LTD are driven by characteristics of pre- and post-synaptic neuronal activities [8, 9, 10] and can also be affected by neuromodulators [11, 7, 12, 13, 14]. Hebbian plasticity is the long-standing theory that LTP arises when a pre-synaptic neuron repeatedly fires before the post-synaptic neuron [15]. A well-studied paradigm of Hebbian plasticity, spike-timing dependent plasticity (STDP), causes LTP or LTD at synapses depending on the timing of pre- and post-synaptic spike trains, and has been characterized for many neuron types and brain regions [16, 17, 18, 19]. Over the past few decades, Hebbian plasticity rules have been successfully used to capture a variety of phenomena observed in the brain [20, 21] and implement a number of unsupervised learning algorithms [22, 23, 24]. In order to account for the powerful learning capabilities observed in mammals, however, Hebbian learning rules must incorporate information about optimizing some ob- jective function [7, 25, 26, 27]. A global neuromodulatory signal is in theory sufficient for optimizing any behavioral task, but learning is very slow in deep networks, because a global reward signal alone provides very limited information to neurons earlier in the hierarchy [25, 28, 27]. Since simple Hebbian plasticity rules and plasticity rules based on a global reward signal do not appear to be sufficient for solving complex learning tasks, this suggests that upstream neurons in hierarchical circuits must receive top-down neuron-specific signals that guide their direction of plasticity (ie. whether to engage in LTP or LTD) in order to reduce their contribution to a global error. The signals that they receive should therefore reflect their causal effect on this error. This is known as the credit assignment problem – neurons in a hierarchical network should receive a signal of their credit for an error in the output, in order to undergo plasticity that reduces their contribution to this error (Figure 1.1). How the brain might solve the credit assignment problem remains an open question.

In the machine learning field, neural networks with many layers of processing units (deep neural networks) have been incredibly successful at learning difficult tasks, in some cases achieving human-level performance [29, 30, 31]. These networks are trained using gradient descent – given a cost function to minimize, weights throughout a neural network are updated in the opposite direction of the gradient of the cost function with respect to the weights. Gradient descent solves the credit assignment problem, since the gradient is by definition equal to the contribution of a weight to the cost function. Interestingly,

1 Chapter 1. Introduction 2

Figure 1.1: The credit assignment problem in biological circuits involved in visual processing. Visual sensory input (in this example, a wolf) is transformed through the hierarchy of the ventral stream into a representation in inferior temporal gyrus (IT) for “dog”, leading to the generation of error signals in downstream associative regions. Circles represent neurons, and filled circles indicate active neurons. Arrows between neurons represent synaptic connections. In order for useful learning to occur, upstream neurons need to know their contribution to downstream activity in order to undergo synaptic plasticity that leads to a decrease in future error signals. This is known as the credit assignment problem.

deep networks trained using gradient descent learn similar representations of stimuli to those observed in the neocortex [32, 33, 34, 35, 36]. This suggests that learning in the cortex involves something akin to gradient descent. However, a biological implementation of gradient descent is not straightforward, for numerous reasons [37, 38]. For one, the gradient terms that need to be calculated at a layer in the network hierarchy involve all of the downstream feedforward weights in the network. The most common gradient descent algorithm in machine learning, backpropagation of error, or backprop, uses the exact values of downstream weights to calculate the weight updates throughout the network. In a biological network, this would require neurons to receive information on all of the synaptic connec- tions downstream, which is highly implausible given our current understanding of neuronal physiology. The problem of communicating downstream feedforward weights to upstream neurons in a network is known as the weight transport problem. An alternative method to communicate downstream feedforward weights is to transmit error signals using a separate set of feedback connections with symmetric weights to the feedforward ones. Several models have addressed the problem of biologically plausible gradient descent using this approach [39, 40, 41]. However, this raises the non-trivial question of how symmetry between feedforward and feedback synapses can be achieved. In addition, some of these models imply a separate pathway for communicating error signals, in order to separate feedforward sensory signals from error signals used for plasticity [39, 42]. This would require each neuron in a network to have a one-to-one pairing with a feedback neuron. While possible, there is no evidence for this type of one-to-one pairing occurring in biological circuits. Finally, the communication of signed error signals using binary spiking neurons is not straightforward, and any theory for biologically plausible gradient descent should account for this. The overall aim of the work presented here is to develop computational theories that explore how the brain might solve these issues, and to present falsifiable predictions about neural physiology and circuitry that are fruitful avenues for future experimental research. Chapter 1. Introduction 3

1.1 Research contributions and thesis outline

This thesis explores how gradient descent can be approximated in a biologically plausible manner in deep hierarchical networks, while addressing (1) the issue of symmetric feedback connections, (2) the need for a separate pathway to communicate error signals, and (3) how signed error signals can be com- municated. The work presented in this thesis contributes to the growing body of research on biologically plausible gradient descent by developing novel models for how the brain can coordinate plasticity in deep hierarchical networks in a way that approximates gradient descent and enables learning of complex tasks. In addition, this work presents experimental predictions that can be investigated in future work.

The thesis is divided into three projects, each of which corresponds to a research paper that is either published or accepted for publication at a journal or conference:

• Project 1: Towards deep learning with segregated dendrites (published in eLife [43])

• Project 2: Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits (accepted for publication in Nature Neuroscience, available as a preprint [44])

• Project 3: Spike-based causal inference for weight alignment (published as a conference paper at ICLR [45])

The next chapter provides a background overview of the relevant neuroanatomical and machine learning concepts that inform this work, as well as other related models of biologically plausible gradient descent. In addition, a brief synopsis of each project is provided.

Chapter 3 describes Project 1, which presents a model that leverages properties of pyramidal cortical neurons, namely segregated apical dendrites and dendritic plateau potentials, to approximate gradient descent in a biologically plausible manner. This model uses reciprocal feedback connections between neurons with fixed, random weights, taking advantage of the feedback alignment effect [46] to overcome the weight transport problem. Segregated apical dendrites enable feedforward and feedback signals to be communicated without requiring a separate feedback pathway, and signed error signals are commu- nicated using the difference in dendritic plateau potentials.

Chapter 4 describes Project 2, which demonstrates that ensembles of pyramidal neurons can encode both sensory feedforward information and feedback error signals concurrently using bursting, and that this multiplexing can enable networks of ensembles to approximate gradient descent and solve difficult visual recognition tasks. A learning rule for feedback weights enables them to learn to become symmetric with feedforward weights, and signed error signals are communicated using the temporal difference in ensemble burst rates.

Note that the computational work described in Chapter 4 is broadly divided into two parts: a spik- ing simulation model (presented in Sections 4.4.1-4.4.4 and 4.6.1), which was developed by Alexandre Chapter 1. Introduction 4

Payeur, and a rate-based model (Sections 4.4.4 and 4.4.5 and 4.6.2), which was done by Jordan Guer- guiev and is the focus of this thesis.

Chapter 5 describes Project 3, which focuses on a learning rule for feedback weights that allows them to become symmetric with feedforward weights by taking advantage of the spiking properties of neurons. This is accomplished using a technique called regression discontinuity design (RDD), which allows a neuron to estimate its causal effect on the activities of downstream neurons by taking advantage of the binary spiking threshold.

Lastly, Chapter 6 discusses the overall conclusions, limitations and experimental predictions of the work presented in this thesis, as well as future research directions. Chapter 2

Background

2.1 Learning in the brain

The ability of an animal to learn, ie. to acquire knowledge about the external world and use this knowledge to adapt its behavior, is enabled by complex structures in the brain. In the neuroscience field, extensive attention has been given to studying the neural correlates of learning. The neocortex plays a critical role in enabling the powerful cognitive abilities of mammals, such as visual and auditory perception, motor control and language. Through experience with the external world, changes in neural activities in cortical regions enable mammals to acquire new cognitive abilities, such as successfully navigating a maze or learning to play the piano. What specific changes in neural activities enable the learning of new skills and how these changes occur is still an area of active experimental and theoretical research. What follows is a brief outline of the biological and theoretical knowledge that informs the work presented in this thesis, with a focus on cortical neurons.

2.2 Biological neurons

Neurons in the brain can be broadly classified into two types: excitatory and inhibitory. Excitatory neurons are those that have a depolarizing effect on post-synaptic neurons through the release of excita- tory neurotransmitters such as glutamate as result of an action potential, while inhibitory neurons cause post-synaptic hyperpolarization by releasing inhibitory neurotransmitters like gamma-Aminobutyric acid (GABA). Within each of these two broad classifications, a large variety of neurons can be found, in terms of morphology, physiology, connectivity and function.

2.2.1 Inhibitory interneurons

Inhibitory neurons comprise about 10-20% of the cortical neuronal population [47]. While they com- prise a relatively small subset of neocortical neurons, their axons project diffusely onto large amounts of neurons [48, 49, 50], and they play an important role in sensory processing, motor control and [51]. In addition, inhibitory neurons are essential for maintaining balanced cortical activity [14].

Almost all neocortical inhibitory neurons are interneurons (neurons whose axons are limited to within a single brain area) [50]. These interneurons can be placed in three broad classes based on protein ex-

5 Chapter 2. Background 6 pression: parvalbumin-expressing (PV+) interneurons, somatostatin-expressing (SST+) interneurons, and serotonin receptor 5HT3a-expressing (5HT3aR+) interneurons [50]. 5HT3aR+ interneurons are fur- ther broken down into those that express vasoactive intestinal peptide (VIP), and those that do not.

Recent studies of these interneuron classes have found a set of consistent connectivity motifs through- out major neocortical areas [52, 50]. PV+ interneurons are mainly comprised of fast-spiking basket cells and chandelier cells, and mainly project into perisomatic and axonal regions of pyramidal neurons. SST+ interneurons are mostly comprised of Martinotti cells, which primarily project to apical dendrites of pyramidal neurons. Finally, VIP+ cells are bipolar or multipolar cells that provide disinhibition on excitatory cells by inhibiting PV+ and SST+ interneurons. Cortical inhibitory interneurons receive local inputs from pyramidal neurons and other interneurons, long-range thalamic inputs as well as inputs from other cortical regions [50, 52].

2.2.2 Pyramidal neurons

Pyramidal neurons are the most abundant type of excitatory neuron, found in the brains of all mam- mals, birds, fish and reptiles [53]. They are characterized by having two distinct dendritic domains, the basal and apical dendrites, as well as the pyramidal shape of the soma. In the neocortex, they are found in all layers except layer 1, in which apical dendritic tufts of pyramidal neurons in other layers are found. There are also differences in the inputs and outputs of pyramidal neurons found in different cortical layers. Layer 4 pyramidal neurons receive feedforward sensory input from thalamic projections, and project locally to layer 2/3, and to some extent layer 5 [53]. Layer 2/3 pyramidal neurons project locally to layer 5 neurons as well as to other cortical regions. Pyramidal neurons in layer 5 typically have long-range connections to other cortical regions, as well as to subcortical areas [54, 53]. Finally, layer 6 neurons receive directy thalamic input and provide long-range feedback to primary thalamic nuclei and intracortical projections to layer 4 [55, 56, 57]. Axons carrying feedback signals from higher order cortical areas project principally to layer 1 apical dendritic tufts, but also to layer 5 and 6 [58, 59].

Interestingly, pyramidal neurons have several unique properties that lead to complex behaviors, outlined below. As presented in this work, these properties can theoretically enable networks of pyramidal neurons to perform gradient descent.

Segregated dendritic compartments, action potentials and bursting

Pyramidal neurons in layer 5 of cortex, as well as in layer 2/3 and in hippocampus, have two distinct sets of dendrites: shorter basal dendrites near the soma, and longer apical dendrites that are electrotonically distant from the soma [60, 61, 62]. Several studies support the notion that basal dendrites of layer 5 pyramidal neurons typically receive bottom-up sensory input, while apical dendrites generally receive top-down feedback signals from higher-order processing areas [63, 64, 65, 53]. Patch-clamp studies have identified that apical dendrites of pyramidal neurons in layer 5 and in CA1 of hippocampus can generate supralinear dendritic Ca2+ spikes (also known as calcium plateau potentials) at an initiation zone near the main apical bifurcation [66, 61, 60]. These spikes can cause sufficient depolarization at the axonal action potential (AP) initiation zone to generate an action potential. Ca2+ spikes can be generated by coincident apical dendritic input and back-propagating APs [60, 61]. In addition, input at thin apical Chapter 2. Background 7 tuft dendrites can trigger N-methyl-D-aspartate (NMDA) spikes, which can in turn generate Ca2+ spikes and action potentials [61].

Notably, a back-propagating AP can initiate an additional Ca2+ spike, given sufficient apical input [61, 63]. The interaction between Ca2+ spikes and back-propagating APs leads to a series of high- frequency APs being generated at the soma, known as bursts. Ca2+ spikes and AP bursts have been implicated in the induction of long-term synaptic plasticity in pyramidal neurons [67, 68, 69].

2.3 Synaptic plasticity

Synaptic plasticity refers to changes in synaptic strength between neurons in the brain. Synaptic strength refers to the relative size of post-synaptic depolarizing or hyperpolarizing current that is generated by a pre-synaptic spike, and is dependent on many factors, including the volume of neurotransmitter released by the pre-synaptic cell and the types, states and amounts of post-synaptic receptors at the synapse. Synaptic plasticity has been observed in virtually every type of synapse in both invertebrates and mammals and is believed to be a key physiological substrate for learning and memory. There are two main types of synaptic plasticity: short-term plasticity, which refers to temporary presynaptic activity- dependent changes in synaptic strength that last on the order of tens of milliseconds to a few minutes, and long-term plasticity, which operates on much longer timescales (hours or more).

2.3.1 Short-term plasticity

Short-term plasticity (STP) includes short-term facilitation (STF) and short-term depression (STD). Because of their transient nature, STF and STD are not believed to be directly involved in learning and memory, but rather to act as adaptive mechanisms. STP usually occurs as a result of pre-synaptic mechanisms [70]. STD is most commonly characterized by a depletion of the readily-releasable pool (RRP) of synaptic vesicles and subsequent drop in neurotransmitter release, but can also arise from saturation or desensitization of post-synaptic receptors [70, 71]. The mechanisms underlying STF are less well-characterized, although various theories have been proposed, including increased neurotransmitter release due to residual Ca2+ in the pre-synaptic bouton, or the binding of Ca2+ to calcium buffers such as calbindin during the initial action potential, leading to an overall increase of Ca2+ reaching the neurotransmitter release sites with subsequent APs [70].

2.3.2 Long-term plasticity

Long-term plasticity, in the form of long-term potentiation (LTP) and long-term depression (LTD), involves persistent changes in the synaptic strength at a chemical synapse. This type of plasticity is considered to be the main mechanism for information storage and learning in the brain [1], and occurs at both excitatory and inhibitory synapses [72].

Most research of LTP has focused on the hippocampus, due to its importance for memory formation, but an additional body of work has taken a closer look at LTP in other cortical areas such as motor, auditory and visual cortex [69, 73]. LTP is mostly, if not exclusively, a post-synaptic phenomenon. At excitatory synapses, a widely studied form of LTP is mediated by the activation of NMDA ionotropic Chapter 2. Background 8 glutamate receptors (NMDARs) [9, 69, 74]. Influx of Ca2+ through NMDARs activates CaMKII proteins, which trigger phosphorylation of AMPARs and upregulate the trafficking of AMPARs to the membrane [9]. Likewise, LTD triggered by activation of NMDA receptors leads to dephosphorylation and removal of AMPA receptors from the synapse. Both NMDAR-mediated LTP and LTD are dependent on activation of both the pre- and post-synaptic cell, as NMDARs are activated only when glutamate is bound and there is sufficient depolarization to expel their magnesium block. Various forms of LTP and LTD have also been identified at inhibitory synapses [75, 76].

2.3.3 Hebbian plasticity, neuromodulation and synaptic tagging

A long-standing postulate in theoretical neuroscience, known as Hebb’s postulate, states that long-term synaptic plasticity is dependent on the co-occurrence of pre- and post-synaptic activity (“cells that fire together, wire together”). Experimental protocols for inducing LTP and LTD have demonstrated that LTP and LTD can be induced by activation of the pre-synaptic neuron and either activation or depolarization of the post-synaptic neuron [3]. Thus, a general Hebbian plasticity rule has the form:

dw ij = s (t)a (t) (2.1) dt i j

where wij is the strength of the synapse between pre-synaptic neuron j and post-synaptic neuron i, si is the state of the post-synaptic neuron (its membrane potential, firing rate, spike times, etc.), and aj is the spiking activity of the pre-synaptic neuron. A wealth of experimental evidence supports Hebb’s postulate – for example, both NMDAR-mediated LTP and LTD are dependent on coincident pre- and post-synaptic activity. Hebbian plasticity rules have been demonstrated to generate features seen in the cortex, such as ocular dominance and orientation columns seen in visual cortex [20], and are capable of learning unsupervised learning tasks. However, the simple two-factor Hebbian learning rule (where the two factors are the presynaptic and postsynaptic activities) is not sufficient for learning a behavioral objective, because it does not contain any term that reflects the direction of weight change that would reduce a learning-related error [27, 77]. To account for this, weight update rules incorporating a third factor reflecting some form of neuromodulatory signal that gates plasticity, have been proposed. These plasticity rules have the form:

dw ij = s (t)a (t)M(t) (2.2) dt i j where M(t) is a neuromodulatory signal, representing reward, the difference between reward and ex- pected reward, or saliency of an event. Experimentally, relationships between neuromodulatory activity and learning of behaviors have been identified, such as the importance of dopamine (DA) neurons and DA receptor activation on the learning of reward-motivated behavior [78, 79].

Another issue with two-factor correlative Hebbian plasticity rules is that the timescale of behavior and reward is much longer than that of purely Hebbian synaptic plasticity mechanisms [7]. The theory of synaptic tagging and capture (STC) addresses this issue, and recent experiments have found evidence supporting this theory in the hippocampus, cortex and striatum [73, 80, 81]. The STC theory posits that Hebbian mechanisms create a synaptic tag at synapses that decays at a behavioural timescale (seconds). However, the induction of this tag does not trigger plasticity in the form of LTP or LTD unless a global neuromodulatory signal is present, acting as a third factor. This process can be formalized using the Chapter 2. Background 9 following equations:

deij eij = si(t)aj(t) − (2.3) dt τe dw ij = e (t)M(t) (2.4) dt ij

where eij is the synaptic tag at the synapse between presynaptic neuron j and postsynaptic neuron i that decays with time constant τe, and M(t) is a modulatory signal that could be determined by neuromodulators such as dopamine, serotonin, acetylcholine or noradrenaline. Note that M(t) is not neuron-specific, since most neuromodulator-releasing neurons project diffusely onto a large amount of neurons [7]. STC provides a compelling theory for how reward-based learning may be implemented in the brain. Synaptic tags are analogous to eligibility traces in reinforcement learning models in machine learning. These are powerful models capable of reward-based learning in a number of difficult tasks, in some cases surpassing human performance [82, 31].

2.3.4 Spike timing dependent plasticity

The induction of LTP or LTD at a synapse, depending on the timing of pre- and post-synaptic spikes, is a well-studied form on Hebbian plasticity called spike-timing dependent plasticity (STDP). STDP has been observed in both excitatory and inhibitory synapses in many different neuron types throughout the brain [83, 16, 84, 17]. In a typical STDP protocol, pre-synaptic spikes are repeatedly paired with post-synaptic spikes with a fixed pre/post spiking interval. The spike interval can be positive, when the pre-synaptic neuron spikes before the post-synaptic neuron, or negative if the post-synaptic neuron spikes first. The change in synaptic strength following the induction protocol is then measured, and the protocol is repeated with a different pre/post interval. The window of pre/post timing that causes LTP or LTD can be very different depending on the brain region, and on whether the pre- and post-synaptic neurons are excitatory or inhibitory [16].

Distinct mechanisms for STDP induction have been characterized at excitatory and inhibitory synapses. At excitatory synapses, STDP depends on both NMDAR activation and a rise in post-synaptic Ca2+ level [85, 16], and has also been found to depend on pre-synaptic factors [86]. At GABAergic synapses, on the other hand, STDP appears to depend only on post-synaptic factors, and has been linked to influx of Ca2+ through L- and T-type voltage-gated Ca2+ channels and subsequent changes in the K+-Cl- co-transporter KCC2 state leading to a change in GABA reversal potential [87, 84, 88].

The traditional STDP protocol is a form of purely Hebbian plasticity, but recent work suggests that neuromodulatory signals can either gate STDP plasticity induction entirely or modify the STDP window [11, 7, 12, 13, 14], consistent with three-factor plasticity rules.

2.4 Machine learning

Machine learning is a branch of statistics and computer science that studies computational algorithms for learning patterns in data and making accurate predictions or decisions, without being explicitly Chapter 2. Background 10 programmed to perform the task. In recent years, machine learning models have been extremely suc- cessful at solving complex tasks related to visual processing [89, 90, 91], speech recognition [92], natural language processing [93, 94, 95], and control [31, 96, 97]. These models leverage large amounts of train- ing data related to the task at hand and fast processing power enabled by graphical processing units (GPUs). Machine learning models have successfully learned to perform image recognition, speech recog- nition, object detection, image generation, sentiment analysis, playing video and board games, driving autonomous vehicles, and many other tasks. Machine learning algorithms are broadly classified into three families: supervised learning (where a model is trained to perform a classification or regression task using an explicit teaching signal), unsupervised learning (where a model does not receive any teach- ing signal and is tasked with reducing the dimensionality of data by learning meaningful relationships in its features), and reinforcement learning (where a model receives a global reward or punishment signal and is tasked with learning a strategy for maximizing the received reward). The scope of this thesis includes only supervised learning algorithms, although the learning rules presented should be extendable to reinforcement learning and unsupervised learning domains.

2.4.1 Artificial neural networks

Artificial neural networks (ANNs) are a type of machine learning model that consists of a hierarchy of layers of computational units (Figure 2.1A). Each unit integrates its weighted inputs and applies an acti- vation function in order to produce a scalar output. Neural networks can be feedforward networks, where units in layer l provide input only to units in downstream layer l + 1, or recurrent networks, in which units in the same layer can provide input to each other or a unit can provide input to itself. Typically, feedforward networks are used to learn classification or regression tasks involving static inputs (such as images), while recurrent networks are well-suited to learning functions of sequential inputs (such as text documents or videos).

Non-linear activation functions are typically used in ANNs, such as the logistic (or sigmoid), hy- perbolic tangent, rectified linear unit (ReLU), or softmax functions, which allow them to approximate complex non-linear functions. ANNs with these activation functions are universal approximators, mean- ing they can approximate almost any continuous function [98]. This property, combined with powerful learning algorithms like backpropagation of error, has enabled their success at a wide variety of difficult learning tasks.

2.4.2 Gradient descent

Training of ANNs involves performing gradient descent on a cost function, L. In a supervised setting, this cost function is a measure of the error between the output of the network and the desired output, or target. For example, given an input x, the mean squared error (MSE) cost function measures the squared difference between the output and target:

n 1 X L = kˆy − y k2 (2.5) MSE n i i i=1

where n is the number of input samples, and yi and ˆyi are the output and target output for input sample i, respectively. The goal of training is to change the parameters in the network in order to Chapter 2. Background 11

Figure 2.1: A. A typical fully-connected feedforward artifical neural network, containing an input layer, a series of hidden layers, and an output layer. Circles denote units in each layer. Filled circles represent active units. Arrows represent feedforward connections between units. The output layer activity is used to compute the cost function L, which is the objective function that is minimized using gradient descent. B. A convolutional neural network utilizes convolutional layers. Each unit in these layers receives input only from a subset of the units in the previous layer (in this example, units in the first convolutional layer receive input from a subset of pixels in the input image). Dashed lines indicate units’ receptive fields. Chapter 2. Background 12 minimize this cost function. The solution cannot be solved analytically when non-linear activation functions are used, therefore the parameters in the network are updated in a stochastic manner, such that the value of the cost function decreases after every parameter change. In ANNs, the parameters that are updated are the weights between units. Weights are updated using gradient descent, which updates every weight w in the direction opposite to the gradient of the cost function with respect to the weight:

∂L ∆w ∝ − (2.6) ∂w Gradient descent is guaranteed to update weights in a way that reduces the cost function L.

2.4.3 Backpropagation of error (backprop)

The most common algorithm used to train neural networks using gradient descent is backpropagation of error, or backprop. Backprop takes advantage of the chain rule property of derivatives to optimize the computation of weight updates by avoiding redundant calculations. Weight updates are computed starting from the output layer of the network and iterating backward through the network. The previ- ously computed derivatives with respect to the units at layer l are used to compute the derivatives for layer l − 1:

l+1 ! ∂L X ∂yj δl ≡ = δl+1 (2.7) i ∂yl ∂yl j i j i

l l where yi and δi are the output and the error derivative of unit i in layer l, respectively. In a fully-connected feedforward neural network with linear activation functions, this is equivalent to:

l X l l+1 δi = wijδj (2.8) j

l where wij is the feedforward weight between unit i in layers l and unit j in layer l + 1.

2.4.4 Convolutional neural networks

Rather than using fully-connected layers throughout the network, convolutional neural networks (CNNs) introduce convolutional layers (Figure 2.1B). These layers perform convolution over activities in the pre- vious layer, rather than weighted integration. Each unit in a convolutional layer receives input only from a subset of the units in the previous layer. The development of convolutional layers was inspired by neurons in the visual processing regions of the brain, which exhibit retinotopically organized receptive fields. The receptive field of a neuron in the visual system is the subset of the visual space that, when stimulated, causes the neuron to respond. Receptive field size grows with the depth into the visual system hierarchy, such that neurons in V4 have larger receptive fields than V1 cells [99], a phenomenon replicated in convolutional networks. The introduction of CNNs was a breakthrough in visual processing that enabled much more efficient learning of visual recognition tasks. Chapter 2. Background 13

Unlike real neurons, convolutional layers in ANNs use shared weights, that is, all of the units in a convolutional layer integrate their inputs using the same set of weights. This dramatically reduces the number of parameters in the model, making learning much more efficient. Shared weights are clearly biologically implausible, but recent work demonstrates that CNNs trained without shared weights, known as locally-connected CNNs, can match the performance of traditional CNNs that use shared weights [100, 101].

2.5 Weight symmetry

Gradient descent requires the use of information about all downstream weights in the calculation of weight updates. This arises from the definition of the partial derivative (see equations 2.7 and 2.8). In neuroscience, this is known as the weight transport problem, because it is unclear how this information could be communicated given our current understanding of the brain. The weight transport problem can be solved if top-down feedback signals are communicated through feedback connections that have symmetric weights to the feedforward connections. However, the biological plausibility of symmetric feedforward and feedback connections, and how this symmetry can be achieved, remain unclear. Various algorithms have been proposed that deal with the issue of weight transport, some of which are outlined below.

2.5.1 Feedback alignment

Lillicrap et al. [46] identified that symmetric weights are not strictly necessary for gradient descent. Instead, a neural network that uses fixed, random feedback weights is able to solve machine learning tasks like image classification. In a network with fixed, random feedback weights, the feedback alignment (FA) effect causes feedforward weights to naturally become more aligned with feedback weights during the course of learning the task, in furtherance of minimizing the global cost function. While feedback alignment is able to match the performance of backprop on tasks like MNIST handwritten digit classifi- cation, further research has shown that it cannot successfully learn most difficult tasks requiring deeper convolutional networks, such as CIFAR-10 and ImageNet image classification [100, 102]. This perfor- mance gap is partially overcome if feedback weights are not necessarily symmetric, but match the sign of feedforward weights [103, 104, 102]. It is now clear that, while strict weight symmetry is not required for gradient descent, the degree of weight symmetry correlates with learning performance, and that strictly symmetric weights are the optimal regime [102, 100, 103, 105].

A related model to feedback alignment, called direct feedback alignment (DFA), uses a single set of error units that project back to all neurons in the network, rather than layer-by-layer sets of error units. This dramatically reduces the amount of error units required, especially in deep networks. However, like FA, DFA does not match the performance of backprop on difficult learning tasks [100, 102]. Below, two mechanisms for achieving weight symmetry by updating feedback weights are outlined. These operate in network architectures with either reciprocal layer-by-layer feedback connections, or layer-specific error units, as in FA. Future work could develop learning rules for feedback connections in DFA-like network architectures that enable weight symmetry. Chapter 2. Background 14

2.5.2 Kolen-Pollack algorithm

Kolen and Pollack [106] identified that if feedforward and feedback weights are updated using the same weight update term as well as a weight decay term, then they are guaranteed to converge. Suppose that weight updates are of the form:

∆W = A − λW (2.9) ∆Y = A − λY (2.10) (2.11)

where W is the matrix of feedforward weights between two layers in the network, Y is the matrix of reciprocal feedback weights, A is any matrix with the same shape, and 0 < λ < 1. These weight updates will cause W and Y to converge over time, leading to weight symmetry in a network. Networks trained using this algorithm are able to learn complex tasks where feedback alignment fails [42]. The work presented in Chapter 4 uses this form of weight updates in order to achieve symmetric weights.

2.5.3 Weight mirroring

Akrout et al. [42] present a mechanism for learning weight symmetry called weight mirroring. This mechanism is presented in a network with a separate error signaling pathway, with one error unit for each unit in the feedforward layers (however, this algorithm should in theory be compatible with networks using multi-compartment units instead of separate error-carrying units, as in chapters 3 and 4). The network alternates between two modes. During the first mode, called the engaged mode, input propagates forward through the feedforward layers to compute their activities, while the error signal at the output layer propagates backward through the error pathway:

yl+1 = σ(Wl+1yl) (2.12) δl = σ(Yl+1δl+1) (2.13)

where yl is the output of feedforward layer l, Wl+1 are the feedforward weights between layers l and l + 1, δl is the error signal at layer l, Yl+1 are the feedback weights between layers l + 1 and l, and σ is a nonlinear activation function. In the mirror mode, the output of a single layer l of the network is given by noise, and the error units at layers l and l + 1 are clamped such that δl = yl and δl+1 = yl+1. Then, feedback weights Yl+1 are updated using a Hebbian learning rule:

T ∆Yl+1 = ηδlδl+1 (2.14)

where η is a learning rate. The authors demonstrate that this learning rule allows a network to learn tasks where feedback alignment fails. Chapter 2. Background 15

2.6 Related models of biologically plausible gradient descent

In recent years, significant attention has been placed on developing algorithms for biologically plausible gradient descent, in part due to the severe limitations of Hebbian plasticity rules and the immense success of gradient descent-based models in the machine learning field. Below, a selection of these algorithms, some of which directly informed the work presented in this thesis, are briefly outlined.

2.6.1 Contrastive Hebbian learning

Contrastive Hebbian learning (CHL) is a biologically-inspired alternative to backprop that is based on Hebb’s rule. Unlike backprop, the activities of units exhibit temporal dynamics. In CHL, each feedfor- ward connection is matched with a reciprocal feedback connection between units, and units integrate both the feedforward and feedback signals they receive. In standard CHL, feedback weights are assumed to be symmetric to feedforward weights, although recent work has shown that CHL can work with fixed, random feedback weights on some tasks [107]. There are two phases of learning: a free phase, where the input to the network is clamped and propagates through the network until the network reaches a fixed point, and a clamped phase, where the input is clamped and the output units are clamped to the target, until the network reaches another fixed point. Weight updates are based on the difference in the products of pre- and post-synaptic steady-state activities of units in the two phases:

l  ∆W ∝ ˆyl ⊗ ˆyl−1 − ˇyl ⊗ ˇyl−1 (2.15)

where ˆyl and ˇyl are the steady-state outputs at layer l in the free phase and clamped phase, re- spectively. CHL performs gradient descent [108], but the assumption of weight symmetry is biologically problematic, although random feedback weights have been shown to work with the model [107], and weight update rules on feedback weights that achieve symmetry are in theory compatible with the model. However, to date, CHL has not been shown to be able to scale to deep networks.

2.6.2 Equilibrium propagation

Scellier and Bengio [41] present an energy-based learning rule for neural networks that approximates backpropagation of error. In addition, this learning rule has been shown to be compatible with non- symmetric weights [109]. As in CHL, for each training example, the network alternates performs a free phase, where the input is clamped and the state of the network (ie. the activity of each unit in the network) converges to a fixed point called the free fixed point. This is followed by a weakly clamped phase, where the input is clamped and the output units are weakly clamped to move slightly closer to the target output, and the network converges to a weakly clamped fixed point. Weights in the network are then updated according to a function of the pre- and post-synaptic activities during the free and weakly clamped phases. Importantly, they prove that, under certain conditions, this weight update rule approximates gradient descent. Chapter 2. Background 16

2.6.3 Difference target propagation

Lee et al. [110] present a method of approximating gradient descent that constructs local loss functions at each layer in order to update both feedforward and feedback weights in a way that decreases the global loss. They define a ‘target’ at layer l as:

ˆyl = yl + φ(Yl+1ˆyl+1) − φ(Yl+1yl+1) (2.16)

where yl is the output of layer l, ˆyl is the target for layer l, Yl+1 is the set of feedback weights between layers l + 1 and l, and φ is a nonlinear function in the feedback path. Feedforward weights Wl l l l and feedback weights Y are updated to minimize the local loss functions L and Linv, respectively:

l l l 2 L = kˆy − y k2 (2.17) l l l l−1 l−1 2 Linv = kφ(Y σ(W ˆy + )) − (y + )k2 (2.18)

where Wl are the feedforward weights of layer l, σ is the nonlinear activation function in the feed- forward path, and  is a noise term. They demonstrate that this model can match the performance of backprop on the MNIST classification task.

2.6.4 Dendritic prediction learning

Urbanczik and Senn [111] present a two-compartment dynamic model of a pyramidal neuron whose dendritic compartment can learn to reproduce a target signal injected into the somatic compartment. This is accomplished using a three-factor plasticity rule incorporating the pre- and post-synaptic activity as well as the dendritic membrane potential. The importance of dendritic potential as a third factor in plasticity has been suggested by several experimental studies [112, 113]. Inputs at the dendritic compartment with weights w produce voltage Vw(t), while the voltage at the soma, U, is driven by both the dendritic voltage and external nudging conductances. The weight of a dendritic synapse i is updated using the plasticity induction variable PIi(t), given by:

∗  ∗  PIi(t) = S(t) − φ(Vw(t)) h Vw(t) PSPi(t) (2.19)

∗ where S is the spiking output of the neuron, Vw is the “prediction” of the somatic voltage in the absence of any nudging conductances (a scaled version of the dendritic membrane potential), φ is a nonlinear function representing the firing rate, PSPi(t) is the post-synaptic potential at synapse i, and ∗  h is a nonlinear function. The term S(t) − φ(Vw(t)) is the dendritic prediction error, and reflects the difference in the driving input from the dendrites and the actual output of the neuron. The plasticity rule minimizes this error, allowing the dendrites to reproduce the nudged output of the neuron in the absence of the nudging conductances. The voltage and conductance dynamics used in this model formed the basis for the dynamics used to model the spiking neurons in Chapter 3. Chapter 2. Background 17

2.6.5 Dendritic error backpropagation

Sacramento et al. [114] present a neural network comprised of three-compartment pyramidal neurons (basal dendrites, apical dendrites, and soma), with dynamic voltages and firing rates, as well as two- compartment interneurons, analagous to SST neurons, that receive both lateral and top-down input and project to apical dendrites of pyramidal neurons. Similarly to Urbanczik and Senn [111], feedforward weight updates are based on the dendritic prediction error:

d wl ∝ (φ(ul(t)) − φ(ˆvl(t)))rl−1(t) (2.20) dt ij i i j

l where wij is the feedforward weight between presynaptic neuron j in layer l − 1 and postsynaptic l neuron i in layer l, φ is a nonlinear function, ui is the somatic membrane potential of the postsynaptic l neuron,v ˆi is the “dendritic prediction” of the somatic voltage (a scaled version of the dendritic voltage), l−1 and rj (t) is the instantaneous firing rate of the presynaptic neuron. Notably, weights in the network are updated continuously, without any phases of learning. Inhibitory neurons continuously learn to cancel the top-down feedback at apical dendrites of pyramidal neurons, such that apical membrane potentials encode backpropagated errors. The authors prove that this model approximates gradient descent, and demonstrate that a one-hidden-layer network with fixed feedback weights performs well at the MNIST classification task, performing comparably to backpropagation of error. In addition, a learning rule for feedback weights, similar to that used in difference target propagation, is presented.

2.6.6 Updated random feedback

Yali Amit [101] presents a model for biologically plausible gradient descent in CNNs that is demonstrated to perform similarly to backprop on challenging visual classification tasks. Notably, the author utilizes a Hebbian weight update rule:

l l l−1 ∆Wij ∝ δixj (2.21)

l l−1 where δi is the error signal for unit i in layer l, and xj is the activity of unit j in layer l − 1. Importantly, this equation implies that the activity of the post-synaptic unit during plasticity reflects l the error signal δi. The model also incorporates a symmetric weight update rule for feedback weights that causes them to become correlated with feedforward weights over time. Quite simply, feedback weights are initialized randomly, but receive the same weight updates as the reciprocal feedforward weights. This is equivalent to the Kolen-Pollack weight update rule without the weight decay terms. In addition, the author introduces a more biologically plausible alternative to the softmax activation function (that is commonly used in machine learning models) for output layer units, which does not require non-local information. Finally, the author modifies convolutional layers to use locally-connected weights rather than shared weights, in order to improve biological realism, and demonstrates that networks with locally- connected weights achieve comparable performance to standard convolutional networks. Chapter 2. Background 18

2.6.7 Burst ensemble multiplexing

Using a two-compartment model of pyramidal neurons, Naud and Sprekeler [115] demonstrate that en- sembles of pyramidal neurons are capable of simultaneously encoding two distinct signals in independent features of the ensemble activity. When injecting an ensemble of pyramidal neurons with two distinct currents, one at the soma, and the other at apical dendrites, the instantaneous event rate of the ensem- ble (where an event is defined as either a single spike or a burst of spikes) reflects the somatic input, while the burst probability of the ensemble (defined as the fraction of events that are bursts) reflects the apical input. The burst rate of the ensemble, defined as the product of the event rate and burst probability, reflects a mixture of the two inputs. Moreover, the event rate of an ensemble can be decoded by downstream neurons using STD synapses, while the burst rate can be decoded using STF synapses. The burst probability can also be decoded using STF synapses combined with disynaptic inhibition from STD synapses. Finally, the authors propose that these mechanisms can allow ensembles of pyramidal neurons to simultaneously communicate bottom-up sensory input and top-down feedback for plasticity through a multi-layer hierarchical network. This work is the basis for Chapter 4, which demonstrates that ensemble multiplexing using bursts can allow for efficient gradient descent in a deep neural network.

2.7 Project synopses

The following three chapters are comprised of manuscripts describing the three projects that were com- pleted as part of this thesis. Project 1 (Chapter 3) was peer-reviewed and published in eLife [43]. Project 2 (Chapter 4) was peer-reviewed and accepted for publication at Nature Neuroscience, and is available as a preprint [44]. Project 3 (Chapter 5) was peer-reviewed and published as a conference paper at the 2020 International Conference on Learning Representations (ICLR) [45]. Below are brief summaries describing the methodology and results of each project.

2.7.1 Project 1: Towards deep learning with segregated dendrites

The first project presents a biologically plausible computational model of credit assignment in multi- layer neural networks using neurons with electrotonically segregated dendrites. The model uses multi- compartment spiking neurons (inspired by the model developed by Urbanczik and Senn [111]) with segregated dendritic compartments, voltage dynamics and conductances. Neurons receive feedforward input at the basal dendrites and feedback signals from the downstream layer at the apical dendrites. The network is trained using two phases, a forward phase where input propagates through the network until a steady state is reached, and a target phase where a nudging target signal is introduced at the output layer that pushes the activity of output layer neurons towards the target. During each phase, apical dendrites are segregated from the soma and therefore do not affect the activity of the neurons. At the end of each phase, a nonlinear function of the average apical potential, representing a Ca2+ plateau potential, is communicated to the soma. The difference in the two plateau potentials in the forward and target phases is used to update the basal weights. The model uses fixed, random feedback weights, taking advantage of the feedback alignment effect. A multi-layer network is able to approximate gradient descent and take advantage of additional layers to achieve good performance on the MNIST classification task. Chapter 2. Background 19

2.7.2 Project 2: Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits

The second project presents a novel theory (called “burstprop”) of how networks of ensembles of neurons can coordinate gradient descent using burst multiplexing. This project builds on the work by Naud and Sprekeler [115], which used simulations to demonstrate that pyramidal neuron ensembles can encode two distinct information streams in their event rates and burst probabilities, and can decode event rates and burst rates using STD or STF synapses. The present work presents a three-factor learning rule for feedforward synapses in a network of neuron ensembles that incorporates the temporal difference in burst rate, the presynaptic event rate and a neuromodulatory signal that gates plasticity. A mathematical analysis shows that this learning rule approximates gradient descent. Using biophysically realistic models of spiking neurons, this work demonstrates that a multi-layer network of neuron ensembles can learn to solve the exclusive-or (XOR) problem, a task that requires proper credit assignment throughout a multi-layer network. Furthermore, a simplified rate model of deep convolutional neural networks using the proposed learning rule is trained on the MNIST, CIFAR-10 and ImageNet image classification tasks, which are standard benchmarks used in machine learning. The burstprop learning rule for feedforward weights, combined with a learning rule for feedback weights based on the Kolen-Pollack algorithm, is able to match backprop performance on MNIST and CIFAR-10 and achieve good performance on ImageNet, demonstrating that burstprop enables gradient descent in deep neural networks.

2.7.3 Project 3: Spike-based causal inference for weight alignment

The third project presents a novel method for synapses receiving feedback signals to become symmetric with feedforward synapses by leveraging the spiking behavior of neurons. This work is based on that of Lansdell and Kording [116], who demonstrate that neurons in a network can estimate their causal effect on a downstream reward signal using a technique from regression discontinuity design (RDD). This technique estimates the causal effect of any binary treatment on an outcome by comparing the value of the outcome just below and just above the binary threshold. The authors apply this to the case of neurons, where the binary treatment is the spiking of the neuron, and the outcome is the reward signal. We extend this result to demonstrate that a synapse can obtain an unbiased estimate of the causal effect of the neuron’s activity on the spiking input that it receives, and that this causal estimate can be translated into a plasticity rule for feedback synapses in a network that enables weight symmetry. Furthermore, we demonstrate that a network trained with this learning rule for feedback weights is able to match the performance of backprop on the Fashion-MNIST, SVHN, CIFAR-10 and VOC image classification tasks. Chapter 3

Project 1: Towards deep learning with segregated dendrites

Jordan Guerguiev1,2, Timothy P. Lillicrap4, and Blake A. Richards1,2,3,*

1 Department of Biological Sciences, University of Toronto Scarborough, Toronto, ON, Canada 2 Department of Cell and Systems Biology, University of Toronto, Toronto, ON, Canada 3 Learning in Machines and Brains Program, Canadian Institute for Advanced Research, Toronto, ON, Canada 4 DeepMind, London, UK * Corresponding author, email: [email protected]

This chapter was originally published as a manuscript at eLife [43].

3.1 Abstract

Deep learning has led to significant advances in artificial intelligence, in part, by adopting strategies motivated by neurophysiology. However, it is unclear whether deep learning could occur in the real brain. Here, we show that a deep learning algorithm that utilizes multi-compartment neurons might help us to understand how the neocortex optimizes cost functions. Like neocortical pyramidal neurons, neurons in our model receive sensory information and higher-order feedback in electrotonically segregated compartments. Thanks to this segregation, neurons in different layers of the network can coordinate synaptic weight updates. As a result, the network learns to categorize images better than a single layer network. Furthermore, we show that our algorithm takes advantage of multilayer architectures to identify useful higher-order representations—the hallmark of deep learning. This work demonstrates that deep learning can be achieved using segregated dendritic compartments, which may help to explain the morphology of neocortical pyramidal neurons.

20 Chapter 3. Project 1: Towards deep learning with segregated dendrites 21

3.2 Author contributions

Jordan Guerguiev (JG) developed the computational model and performed all experiments. He also contributed to writing the manuscript. Timothy P. Lillicrap served as an advisor for the development of the model, and assisted with writing the manuscript. Blake A. Richards is the thesis supervisor for JG, and provided guidance and support with the development of the model and selection of experiments to perform. He also contributed to the writing of the manuscript.

3.3 Introduction

Deep learning refers to an approach in artificial intelligence (AI) that utilizes neural networks with mul- tiple layers of processing units. Importantly, deep learning algorithms are designed to take advantage of these multi-layer network architectures in order to generate hierarchical representations wherein each successive layer identifies increasingly abstract, relevant variables for a given task [117, 118]. In recent years, deep learning has revolutionized machine learning, opening the door to AI applications that can rival human capabilities in pattern recognition and control [119, 120, 121]. Interestingly, the representa- tions that deep learning generates resemble those observed in the neocortex [36, 34, 35], suggesting that something akin to deep learning is occurring in the mammalian brain [32, 122]. Yet, a large gap exists between deep learning in AI and our current understanding of learning and memory in neuroscience. In particular, unlike deep learning researchers, neuroscientists do not yet have a solution to the “credit assignment problem” [123, 46, 37]. Learning to optimize some behavioral or cognitive function requires a method for assigning “credit” (or “blame”) to neurons for their contribution to the final behavioral output [118, 37]. The credit assignment problem refers to the fact that assigning credit in multi-layer networks is difficult, since the behavioral impact of neurons in early layers of a network depends on the downstream synaptic connections. For example, consider the behavioral effects of synaptic changes, i.e. long-term potentiation/depression (LTP/LTD), occurring between different sensory circuits of the brain. Exactly how these synaptic changes will impact behavior and cognition depends on the downstream connections between the sensory circuits and motor or associative circuits (Figure 3.1A). If a learning algorithm can solve the credit assignment problem then it can take advantage of multi-layer architectures to develop complex behaviors that are applicable to real-world problems [117]. Despite its importance for real-world learning, the credit assignment problem, at the synaptic level, has received little attention in neuroscience. The lack of attention to credit assignment in neuroscience is, arguably, a function of the history of biological studies of synaptic plasticity. Due to the well-established dependence of LTP and LTD on presynaptic and postsynaptic activity, current theories of learning in neuroscience tend to emphasize Hebbian learning algorithms [85, 124], that is, learning algorithms where synaptic changes depend solely on presynaptic and postsynaptic activity. Hebbian learning models can produce representations that resemble the representations in the real brain [125, 126] and they are backed up by decades of experi- mental findings [3, 85, 124]. But, current Hebbian learning algorithms do not solve the credit assignment problem, nor do global neuromodulatory signals used in reinforcement learning [46]. As a result, deep learning algorithms from AI that can perform multi-layer credit assignment outperform existing Hebbian models of sensory learning on a variety of tasks [32, 34]. This suggests that a critical, missing component in our current models of the neurobiology of learning and memory is an explanation of how the brain Chapter 3. Project 1: Towards deep learning with segregated dendrites 22

Figure 3.1: The credit assignment problem problem in multi-layer neural networks. (A) Illustration of the credit assignment problem. In order to take full advantage of the multi-circuit ar- chitecture of the neocortex when learning, synapses in earlier processing stages (blue connections) must somehow receive “credit” for their impact on behavior or cognition. However, the credit due to any given synapse early in a processing pathway depends on the downstream synaptic connections that link the early pathway to later computations (red connections). (B) Illustration of weight transport in back- propagation. To solve the credit assignment problem, the backpropagation of error algorithm explicitly calculates the credit due to each synapse in the hidden layer by using the downstream synaptic weights when calculating the hidden layer weight changes. This solution works well in AI applications, but is unlikely to occur in the real brain. solves the credit assignment problem. However, the most common solution to the credit assignment problem in AI is to use the backprop- agation of error algorithm [123]. Backpropagation assigns credit by explicitly using current downstream synaptic connections to calculate synaptic weight updates in earlier layers, commonly termed “hidden layers” [118] (Figure 3.1B). This technique, which is sometimes referred to as “weight transport”, involves non-local transmission of synaptic weight information between layers of the network [46, 127]. Weight transport is clearly unrealistic from a biological perspective [37, 38]. It would require early sensory processing areas (e.g. V1, V2, V4) to have precise information about billions of synaptic connections in downstream circuits (MT, IT, M2, EC, etc.). According to our current understanding, there is no physiological mechanism that could communicate this information in the brain. Some deep learning algo- rithms utilize purely Hebbian rules [41, 128]. But, they depend on feedback synapses that are symmetric to feedforward synapses [41, 128], which is essentially a version of weight transport. Altogether, these artificial aspects of current deep learning solutions to credit assignment have rendered many scientists skeptical of the proposal that deep learning occurs in the real brain [38, 127, 129, 130]. Recent findings have shown that these problems may be surmountable, though. [46], [110] and [104] have demonstrated that it is possible to solve the credit assignment problem even while avoiding weight transport or symmetric feedback weights. The key to these learning algorithms is the use of feedback signals that convey enough information about credit to calculate local error signals in hidden layers [110, 46, 104]. With this approach it is possible to take advantage of multi-layer architectures, leading to performance that rivals backpropagation [110, 46, 104]. Hence, this work has provided a significant breakthrough in our understanding of how the real brain might do credit assignment. Nonetheless, the models of [46], [110] and [104] involve some problematic assumptions. Specifically, Chapter 3. Project 1: Towards deep learning with segregated dendrites 23 although it is not directly stated in all of the papers, there is an implicit assumption that there is a sep- arate feedback pathway for transmitting the information that determines the local error signals (Figure 3.2A). Such a pathway is required in these models because the error signal in the hidden layers depends on the difference between feedback that is generated in response to a purely feedforward propagation of sensory information, and feedback that is guided by a teaching signal [46, 110, 104]. In order to calculate this difference, sensory information must be transmitted separately from the feedback signals that are used to drive learning. In single compartment neurons, keeping feedforward sensory information separate from feedback signals is impossible without a separate pathway. At face value, such a pathway is possible. But, closer inspection uncovers a couple of difficulties with such a proposal.

Figure 3.2: Potential solutions to credit assignment using top-down feedback. (A) Illustration of the implicit feedback pathway used in previous models of deep learning. In order to assign credit, feedforward information must be integrated separately from any feedback signals used to calculate error for synaptic updates (the error is indicated here with δ). (B) Illustration of the segregated dendrites pro- posal. Rather than using a separate pathway to calculate error based on feedback, segregated dendritic compartments could receive feedback and calculate the error signals locally.

First, the error signals that solve the credit assignment problem are not global error signals (like neuromodulatory signals used in reinforcement learning). Rather, they are cell-by-cell error signals. This would mean that the feedback pathway would require some degree of pairing, wherein each neuron in the hidden layer is paired with a feedback neuron (or circuit). That is not impossible, but there is no evidence to date of such an architecture in the neocortex. Second, the error signal in the hidden layer is signed (i.e. it can be positive or negative), and the sign determines whether LTP or LTD occur in the hidden layer neurons [110, 46, 104]. Communicating signed signals with a spiking neuron can theoretically be done by using a baseline firing rate that the neuron can go above (for positive signals) or below (for negative signals). But, in practice, such systems are difficult to operate with a single neuron, because as the error gets closer to zero any noise in the spiking of the neuron can switch the sign of the signal, which switches LTP to LTD, or vice versa. This means that as learning progresses the neuron’s ability to communicate error signs gets worse. It would be possible to overcome this by using many neurons to communicate an error signal, but this would then require many error neurons for each hidden layer neuron, which would lead to a very inefficient means of communicating errors. Therefore, the real brain’s specific solution to the credit assignment problem is unlikely to involve a separate feedback pathway for cell-by-cell, signed signals to instruct plasticity. However, segregating the integration of feedforward and feedback signals does not require a separate Chapter 3. Project 1: Towards deep learning with segregated dendrites 24 pathway if neurons have more complicated morphologies than the point neurons typically used in artificial neural networks. Taking inspiration from biology, we note that real neurons are much more complex than single-compartments, and different signals can be integrated at distinct dendritic locations. Indeed, in the primary sensory areas of the neocortex, feedback from higher-order areas arrives in the distal apical dendrites of pyramidal neurons [64, 131, 132], which are electrotonically very distant from the basal dendrites where feedforward sensory information is received [60, 133, 61]. Thus, as has been noted by previous authors [134, 132, 135], the anatomy of pyramidal neurons may actually provide the segregation of feedforward and feedback information required to calculate local error signals and perform credit assignment in biological neural networks. Here, we show how deep learning can be implemented if neurons in hidden layers contain segregated “basal” and “apical” dendritic compartments for integrating feedforward and feedback signals separately (Figure 3.2B). Our model builds on previous neural networks research [110, 46] as well as computational studies of supervised learning in multi-compartment neurons [111, 134, 135]. Importantly, we use the distinct basal and apical compartments in our neurons to integrate feedback signals separately from feedforward signals. With this, we build a local error signal for each hidden layer that ensures appropriate credit assignment. We demonstrate that even with random synaptic weights for feedback into the apical compartment, our algorithm can coordinate learning to achieve classification of the MNIST database of hand-written digits that is better than that which can be achieved with a single layer network. Furthermore, we show that our algorithm allows the network to take advantage of multi-layer structures to build hierarchical, abstract representations, one of the hallmarks of deep learning [118]. Our results demonstrate that deep learning can be implemented in a biologically feasible manner if feedforward and feedback signals are received at electrotonically segregated dendrites, as is the case in the mammalian neocortex.

3.4 Results

3.4.1 A network architecture with segregated dendritic compartments

Deep supervised learning with local weight updates requires that each neuron receive signals that can be used to determine its “credit” for the final behavioral output. We explored the idea that the cortico- cortical feedback signals to pyramidal cells could provide the required information for credit assignment. In particular, we were inspired by four observations from both machine learning and biology:

1. Current solutions to credit assignment without weight transport require segregated feedforward and feedback signals [110, 46].

2. In the neocortex, feedforward sensory information and higher-order cortico-cortical feedback are largely received by distinct dendritic compartments, namely the basal dendrites and distal apical dendrites, respectively [132, 131].

3. The distal apical dendrites of pyramidal neurons are electrotonically distant from the soma, and apical communication to the soma depends on active propagation through the apical dendritic shaft, which is predominantly driven by voltage-gated calcium channels. Due to the dynamics of Chapter 3. Project 1: Towards deep learning with segregated dendrites 25

voltage-gated calcium channels these non-linear, active events in the apical shaft generate prolonged upswings in the membrane potential, known as “plateau potentials”, which can drive burst firing at the soma [60, 61].

4. Plateau potentials driven by apical activity can guide plasticity in pyramidal neurons in vivo [136, 137].

With these considerations in mind, we hypothesized that the computations required for credit assign- ment could be achieved without separate pathways for feedback signals. Instead, they could be achieved by having two distinct dendritic compartments in each hidden layer neuron: a “basal” compartment, strongly coupled to the soma for integrating bottom-up sensory information, and an “apical” compart- ment for integrating top-down feedback in order calculate credit assignment and drive synaptic plasticity via “plateau potentials” [136, 137] (Figure 3.3A).

Figure 3.3: Illustration of a multi-compartment neural network model for deep learning. (A) Left: Reconstruction of a real pyramidal neuron from layer 5 mouse primary visual cortex. Right: Illustration of our simplified pyramidal neuron model. The model consists of a somatic compartment, plus two distinct dendritic compartments (apical and basal). As in real pyramidal neurons, top-down inputs project to the apical compartment while bottom-up inputs project to the basal compartment. (B) Diagram of network architecture. An image is used to drive spiking input units which project to the hidden layer basal compartments through weights W 0. Hidden layer somata project to the output layer dendritic compartment through weights W 1. Feedback from the output layer somata is sent back to the hidden layer apical compartments through weights Y . The variables for the voltages in each of the compartments are shown. The number of neurons used in each layer is shown in gray. (C) Illustration of transmit vs. plateau calculations. Left: In the transmit mode the network dynamics are updated at each time-step, and the apical dendrite is segregated by a low value for ga, making the network effectively feed-forward. Here, the voltages of each of the compartments are shown for one run of the network. The spiking output of the soma is also shown. Note that the somatic voltage and spiking track the basal voltage, and ignore the apical voltage. However, the apical dendrite does receive feedback, and this is used to drive its voltage. After a period of ∆ts to allow for settling of the dynamics, the average apical voltage is calculated (shown here as a blue line). Right: The average apical voltage is then used to calculate an apical plateau potential, which is equal to the nonlinearity σ(·) applied to the average apical voltage. Chapter 3. Project 1: Towards deep learning with segregated dendrites 26

As an initial test of this concept we built a network with a single hidden layer. Although this network is not very “deep”, even a single hidden layer can improve performance over a one-layer architecture if the learning algorithm solves the credit assignment problem [117, 46]. Hence, we wanted to initially determine whether our network could take advantage of a hidden layer to reduce error at the output layer. The network architecture is illustrated in Figure 3.3B. An image from the MNIST data set is used to set the spike rates of ` = 784 Poisson point-process neurons in the input layer (one neuron per image pixel, rates-of-fire determined by pixel intensity). These project to a hidden layer with m = 500 neurons. The neurons in the hidden layer (which we index with a ‘0’) are composed of three distinct compartments with their own voltages: the apical compartments (with voltages described by the vector 0a 0a 0a 0b 0b 0b V (t) = [V1 (t), ..., Vm (t)]), the basal compartments (with voltages V (t) = [V1 (t), ..., Vm (t)]), and 0 0 0 the somatic compartments (with voltages V (t) = [V1 (t), ..., Vm(t)]). (Note: for notational clarity, all vectors and matrices in the paper are in boldface.) The voltage of the ith neuron in the hidden layer is updated according to:

0 dVi (t) 0 gb 0b 0 ga 0a 0 τ = −Vi (t) + (Vi (t) − Vi (t)) + (Vi (t) − Vi (t)) (3.1) dt gl gl

where gl, gb and ga represent the leak conductance, the conductance from the basal dendrites, and the conductance from the apical dendrites, respectively, and τ = Cm/gl where Cm is the membrance capacitance (see Methods, equation (3.16)). For mathematical simplicity we assume in our simulations a resting membrane potential of 0 mV (this value does not affect the results). We implement electrotonic segregation in the model by altering the ga value—low values for ga lead to electrotonically segregated apical dendrites. In the initial set of simulations we set ga = 0, which effectively makes it a feed-forward network, but we relax this condition in later simulations. We treat the voltages in the dendritic compartments simply as weighted sums of the incoming spike trains. Hence, for the ith hidden layer neuron:

` 0b X 0 input 0 Vi (t) = Wijsj (t) + bi j=1 (3.2) n 0a X 1 Vi (t) = Yijsj (t) j=1

0 0 where Wij and Yij are synaptic weights from the input layer and the output layer, respectively, bi is a bias term, and sinput and s1 are the filtered spike trains of the input layer and output layer neurons, respectively. (Note: the spike trains are convolved with an exponential kernel to mimic postsynaptic potentials, see Methods equation (3.11).) The somatic compartments generate spikes using Poisson processes. The instantaneous rates of these 0 0 0 processes are described by the vector φ (t) = [φ1(t), ..., φm(t)], which is in units of spikes/s or Hz. These rates-of-fire are determined by a non-linear sigmoid function, σ(·), applied to the somatic voltages, i.e. for the ith hidden layer neuron: Chapter 3. Project 1: Towards deep learning with segregated dendrites 27

0 0 φi (t) = φmaxσ(Vi (t)) 1 (3.3) = φ max −V 0(t) 1 + e i

where φmax is the maximum rate-of-fire for the neurons.

The output layer (which we index here with a ‘1’) contains n = 10 two-compartment neurons (one for each image category), similar to those used in a previous model of dendritic prediction learning 1b 1b 1b 1 [111]. The output layer dendritic voltages (V (t) = [V1 (t), ..., Vn (t)]) and somatic voltages (V (t) = 1 1 [V1 (t), ..., Vn (t)]) are updated in a similar manner to the hidden layer basal compartment and soma:

1 dVi (t) 1 gd 1b 1 τ = −Vi (t) + (Vi (t) − Vi (t)) + Ii(t) dt gl ` (3.4) 1b X 1 0 1 Vi (t) = Wijsj (t) + bi j=1

1 0 where Wij are synaptic weights from the hidden layer, s are the filtered spike trains of the hidden layer neurons (see equation (3.11)), gl is the leak conductance, gd is the conductance from the dendrites, and τ is given by equation (3.16). In addition to the absence of an apical compartment, the other salient difference between the output layer neurons and the hidden layer neurons is the presence of the term

Ii(t), which is a teaching signal that can be used to force the output layer to the correct answer. Whether any such teaching signals exist in the real brain is unknown, though there is evidence that animals can represent desired behavioral outputs with internal goal representations [138]. (See below, and Methods, equations (3.19) and (3.20) for more details on the teaching signal).

In our model, there are two different types of computation that occur in the hidden layer neurons: “transmit” and “plateau”. The transmit computations are standard numerical integration of the simula- tion, with voltages evolving according to equation (3.1), and with the apical compartment electrotonically segregated from the soma (depending on ga) (Figure 3.3C, left). In contrast, the plateau computations do not involve numerical integration with equation (3.1). Instead, the apical voltage is averaged over the most recent 20-30 ms period and the sigmoid non-linearity is applied to it, giving us “plateau potentials” in the hidden layer neurons (we indicate plateau potentials with α, see equation (3.5) below, and Figure 3.3C, right). The intention behind this design was to mimic the non-linear transmission from the apical dendrites to the soma that occurs during a plateau potential driven by calcium spikes in the apical dendritic shaft [60], but in the simplest, most abstract formulation possible.

Importantly, plateau potentials in our simulations are single numeric values (one per hidden layer neuron) that can be used for credit assignment. We do not use them to alter the network dynamics. When they occur, they are calculated, transmitted to the basal dendrite instantaneously, and then stored temporarily ( 0-60 ms) for calculating synaptic weight updates. Chapter 3. Project 1: Towards deep learning with segregated dendrites 28

3.4.2 Calculating credit assignment signals with feedback driven plateau po- tentials

To train the network we alternate between two phases. First, during the “forward” phase we present an image to the input layer without any teaching current at the output layer (I(t)i = 0, ∀i). The forward phase occurs between times t0 to t1. At t1 a plateau potential is calculated in all the hidden layer f f f neurons (α = [α1 , ..., αm]) and the “target” phase begins. During this phase, which lasts until t2, the image continues to drive the input layer, but now the output layer also receives teaching current. The teaching current forces the correct output neuron to its max firing rate and all the others to silence.

For example, if an image of a ‘9’ is presented, then over the time period t1-t2 the ‘9’ neuron in the output layer fires at max, while the other neurons are silent (Figure 3.4A). At t2 another set of plateau t t t potentials (α = [α1, ..., αm]) are calculated in the hidden layer neurons. The result is that we have plateau potentials in the hidden layer neurons for both the end of the forward phase (αf ) and the end of the target phase (αt), which are calculated as:

! Z t1 f 1 0a αi = σ Vi (t)dt ∆t1 t1−∆t1 ! (3.5) Z t2 t 1 0a αi = σ Vi (t)dt ∆t2 t2−∆t2

where ∆ts is a time delay used to allow the network dynamics to settle before integrating the plateau, and ∆ti = ti − (ti−1 + ∆ts) (see Methods, equation (3.22) and Figure 3.4A). Similar to how targets are used in deep supervised learning [118], the goal of learning in our network is to make the network dynamics during the forward phase converge to the same output activity pattern as exists in the target phase. Put another way, in the absence of the teaching signal, we want the activity at the output layer to be the same as that which would exist with the teaching signal, so that the network can give appropriate outputs without any guidance. To do this, we initialize all the weight matrices with random weights, then we train the weight matrices W 0 and W 1 using stochastic gradient descent on local loss functions for the hidden and output layers, respectively (see below). These weight updates occur at the end of every target phase, i.e. the synapses are not updated during transmission. Like [46], we leave the weight matrix Y fixed in its initial random configuration. When we update the synapses in the network we use the plateau potential values αf and αt to determine appropriate credit assignment (see below). The network is simulated in near continuous-time (except that each plateau is considered to be instantaneous), and the temporal intervals between plateaus are randomly sampled from an inverse Gaussian distribution (Figure 3.4B, top). As such, the specific amount of time that the network is presented with each image and teaching signal is stochastic, though usually somewhere between 50-60 ms of simulated time (Figure 3.4B, bottom). This stochasticity was not necessary, but it demonstrates that although the system operates in phases, the specific length of the phases is not important as long as they are sufficiently long to permit integration (see Lemma A.1). In the data presented in this paper, all 60,000 images in the MNIST training set were presented to the network one at a time, and each exposure to the full set of images was considered an “epoch” of training. At the end of each epoch, the network’s classification error rate on a separate set of 10,000 test images was assessed with a single Chapter 3. Project 1: Towards deep learning with segregated dendrites 29

Figure 3.4: Illustration of network phases for learning. (A) Illustration of the sequence of network phases that occur for each training example. The network undergoes a forward phase where Ii(t) = 0, ∀i and a target phase where Ii(t) causes any given neuron i to fire at max-rate or be silent, depending on whether it is the correct category of the current input image. In this illustration, an image of a ‘9’ is being presented, so the ’9’ unit at the output layer is activated and the other output neurons are inhibited and silent. At the end of the forward phase the set of plateau potentials αf are calculated, and at the end of the target phase the set of plateau potential αt are calculated. (B) Illustration of phase length sampling. Each phase length is sampled stochastically. In other words, for each training image, the lengths of forward & target phases (shown as blue bar pairs, where bar length represents phase length) are randomly drawn from a shifted inverse Gaussian distribution with a minimum of 50 ms. forward phase for each image (see Methods). The network’s classification was judged by which output neuron had the highest average firing rate during these test image forward phases. It is important to note that there are many aspects of this design that are not physiologically accurate. Most notably, stochastic generation of plateau potentials across a population is not an accurate reflection of how real pyramidal neurons operate, since apical calcium spikes are determined by a number of con- crete physiological factors in individual cells, including back-propagating action potentials, spike-timing and inhibitory inputs [60, 133, 61]. However, we note that calcium spikes in the apical dendrites can be prevented from occurring via the activity of distal dendrite targeting inhibitory interneurons [139], which can synchronize pyramidal activity [140]. Furthermore, distal dendrite targeting interneurons can themselves can be rapidly inhibited in response to temporally precise neuromodulatory inputs [141, 142, 49, 143, 144]. Therefore, it is entirely plausible that neocortical micro-circuits would generate synchro- nized plateaus/bursts at punctuated periods of time in response to disinhibition of the apical dendrites governed by neuromodulatory signals that determine “phases” of processing. Alternatively, oscillations in population activity could provide a mechanism for promoting alternating phases of processing and synaptic plasticity [145]. But, complete synchrony of plateaus in our hidden layer neurons is not actually critical to our algorithm—only the temporal relationship between the plateaus and the teaching signal is critical. This relationship itself is arguably plausible given the role of neuromodulatory inputs in dis-inhibiting the distal dendrites of pyramidal neurons [49, 144]. Of course, we are engaged in a great Chapter 3. Project 1: Towards deep learning with segregated dendrites 30 deal of speculation here. But, the point is that our model utilizes anatomical and functional motifs that are loosely analogous to what is observed in the neocortex. Importantly for the present study, the key issue is the use of segregated dendrites which permit an effective feed-forward dynamic, punctuated by feedback driven plateau potentials to solve the credit assignment problem.

3.4.3 Co-ordinating optimization across layers with feedback to apical den- drites

To solve the credit assignment problem without using weight transport, we had to define local error signals, or “loss functions”, for the hidden layer and output layer that somehow took into account the impact that each hidden layer neuron has on the output of the network. In other words, we only want to update a hidden layer synapse in a manner that will help us make the forward phase activity at the output layer more similar to the target phase activity. To begin, we define the target firing rates for the 1∗ 1∗ 1∗ output neurons, φ = [φ1 , ..., φn ], to be their average firing rates during the target phase:

t 1∗ 1 φi = φi Z t2 (3.6) 1 1 = α φi (t)dt t2 t1+∆ts

(Throughout the paper, we use φ∗ to denote a target firing rate and φ to denote a firing rate averaged over time.) We then define a loss function at the output layer using this target, by taking the difference between the average forward phase activity and the target:

f 1 1∗ 1 2 L ≈ ||φ − φ ||2 t f = ||φ1 − φ1 ||2 2 (3.7) 2 1 Z t2 1 Z t1 = φ1(t)dt − φ1(t)dt α α t t +∆t t t +∆t 2 1 s 1 0 s 2

(Note: the true loss function we use is slightly more complex than the one formulated here, hence the ≈ symbol in equation (3.7), but this formulation is roughly correct and easier to interpret. See Methods, equation (3.23) for the exact formulation.) This loss function is zero only when the average firing rates of the output neurons during the forward phase equals their target, i.e. the average firing rates during the target phase. Thus, the closer L1 is to zero, the more the network’s output for an image matches the output activity pattern imposed by the teaching signal, I(t). Effective credit assignment is achieved when changing the hidden layer synapses is guaranteed to reduce L1. To obtain this guarantee, we defined a set of target firing rates for the hidden layer neurons that uses the information contained in the plateau potentials. Specifically, in a similar manner to [110], 0∗ 0∗ 0∗ we define the target firing rates for the hidden layer neurons, φ = [φ1 , ..., φm ], to be:

f 0∗ 0 t f φi = φi + αi − αi (3.8) Chapter 3. Project 1: Towards deep learning with segregated dendrites 31

t f where αi and αi are the plateaus defined in equation (3.5). As with the output layer, we define the loss function for the hidden layer to be the difference between the target firing rate and the average firing rate during the forward phase:

f 0 0∗ 0 2 L ≈ ||φ − φ ||2 f f 0 t f 0 2 (3.9) = ||φ + α − αi − φ ||2 t f 2 = ||α − α ||2

(Again, note the use of the ≈ symbol, see equation (3.30) for the exact formulation.) This loss function is zero only when the plateau at the end of the forward phase equals the plateau at the end of the target phase. Since the plateau potentials integrate the top-down feedback (see equation (3.5)), we know that the hidden layer loss function, L0, is zero if the output layer loss function, L1, is zero. Moreover, we can show that these loss functions provide a broader guarantee that, under certain conditions, if L0 is reduced, then on average, L1 will also be reduced (see Theorem A.1). This provides our assurance of credit assignment: we know that the ultimate goal of learning (reducing L1) can be achieved by updating the synaptic weights at the hidden layer to reduce the local loss function L0 (Figure 3.5A). We do this using stochastic gradient descent at the end of every target phase:

1 1 ∂L ∆W = −η0 1 ∂W (3.10) ∂L0 ∆W 0 = −η 1 ∂W 0

i i where ηi and ∆W refer to the learning rate and update term for weight matrix W (see Methods, equations (3.28), (3.29), (3.33) and (3.35) for details of the weight update procedures). Performing gradient descent on L1 results in a relatively straight-forward delta rule update for W 1 (see equation (3.29)). The weight update for the hidden layer weights, W 0, is similar, except for the presence of the difference between the two plateau potentials αt − αf (see equation (3.35)). Importantly, given the way in which we defined the loss functions, as the hidden layer reduces L0 by updating W 0, L1 should also be reduced, i.e. hidden layer learning should imply output layer learning, thereby utilizing the multi-layer architecture. To test that we were successful in credit assignment with this design, and to provide empirical support for the proof of Theorem A.1, we compared the loss function at the hidden layer, L0, to the output layer loss function, L1, across all of the image presentations to the network. We observed that, generally, whenever the hidden layer loss was low, the output layer loss was also low. For example, when we consider the loss for the set of ‘2’ images presented to the network during the second epoch, there was a Pearson correlation coefficient between L0 and L1 of r = 0.61, which was much higher than what was observed for shuffled data, wherein output and hidden activities were randomly paired (Figure 3.5B). Furthermore, these correlations were observed across all epochs of training, with most correlation coefficients for the hidden and output loss functions falling between r = 0.2 - 0.6, which was, again, much higher than the correlations observed for shuffled data (Figure 3.5C). Interestingly, the correlations between L0 and L1 were smaller on the first epoch of training (see data in red oval Figure 3.5C) . This suggests that the guarantee of coordination between L0 and L1 only Chapter 3. Project 1: Towards deep learning with segregated dendrites 32

Figure 3.5: Co-ordinated errors between the output and hidden layers. (A) Illustration of output loss function (L1) and local hidden loss function (L0). For a given test example shown to the network in a forward phase, the output layer loss is defined as the squared norm of the difference between target firing rates φ1∗ and the average firing rate during the forward phases of the output units. Hidden layer loss is defined similarly, except the target is φ0∗ (as defined in the text). (B) Plot of L1 vs. L0 for all of the ‘2’ images after one epoch of training. There is a strong correlation between hidden layer loss and output layer loss (real data, black), as opposed to when output and hidden loss values were randomly paired (shuffled data, gray). (C) Plot of correlation between hidden layer loss and output layer loss across training for each category of images (each dot represents one category). The correlation is significantly higher in the real data than the shuffled data throughout training. Note also that the correlation is much lower on the first epoch of training (red oval), suggesting that the conditions for credit assignment are still developing during the first epoch. comes into full effect once the network has engaged in some learning. Therefore, we inspected whether the conditions on the synaptic matrices that are assumed in the proof of Theorem A.1 were, in fact, being met. More precisely, the proof assumes that the feedforward and feedback synaptic matrices (W 1 and Y , respectively) produce forward and backward transformations between the output and hidden layer whose Jacobians are approximate inverses of each other (see Proof of Theorem A.1). Since we begin learning with random matrices, this condition is almost definitely not met at the start of training. But, we found that the network learned to meet this condition. Inspection of W 1 and Y showed that during the first epoch the Jacobians of the forward and backwards functions became approximate inverses of each other (Figure A.1). Since Y is frozen, this means that during the first few image presentations W 1 was being updated to have its Jacobian come closer to the inverse of Y ’s Jacobian. Put another way, the network was learning to do credit assignment. We have yet to resolve exactly why this happens, though the result is very similar to the findings of [46], where a proof is provided for the linear case. Intuitively, though, the reason is likely the interaction between W 1 and W 0: as W 0 gets updated, the hidden layer learns to group stimuli based on the feedback sent through Y . So, for W 1 to transform the hidden layer activity into the correct output layer activity, W 1 must become more like the inverse of Y , which would also Chapter 3. Project 1: Towards deep learning with segregated dendrites 33 make the Jacobian of W 1 more like the inverse of Y ’s Jacobian (due to the inverse function theorem). However, a complete, formal explanation for this phenomenon is still missing, and the issue of weight alignment deserves additional investigation [46]. From a biological perspective, it also suggests that very early development may involve a period of learning how to assign credit appropriately. Altogether, our model demonstrates that deep learning using random feedback weights is a general phenomenon, and one which can be implemented using segregated dendrites to keep forward information separate from feedback signals used for credit assignment.

3.4.4 Deep learning with segregated dendrites

Given our finding that the network was successfully assigning credit for the output error to the hidden layer neurons, we had reason to believe that our network with local weight-updates would exhibit deep learning, i.e. an ability to take advantage of a multi-layer structure [117]. To test this, we examined the effects of including hidden layers. If deep learning is indeed operational in the network, then the inclusion of hidden layers should improve the ability of the network to classify images. We built three different versions of the network (Figure 3.6A). The first was a network that had no hidden layer, i.e. the input neurons projected directly to the output neurons. The second was the network illustrated in Figure 3.3B, with a single hidden layer. The third contained two hidden layers, with the output layer projecting directly back to both hidden layers. This direct projection allowed us to build our local targets for each hidden layer using the plateaus driven by the output layer, thereby avoiding a “backward pass” through the entire network as has been used in other models [46, 110, 104]. We trained each network on the 60,000 MNIST training images for 60 epochs, and recorded the percentage of images in the 10,000 image test set that were incorrectly classified. The network with no hidden layers rapidly learned to classify the images, but it also rapidly hit an asymptote at an average error rate of 8.3% (Figure 3.6B, gray line). In contrast, the network with one hidden layer did not exhibit a rapid convergence to an asymptote in its error rate. Instead, it continued to improve throughout all 60 epochs, achieving an average error rate of 4.1% by the 60th epoch (Figure 3.6B, blue line). Similar results were obtained when we loosened the synchrony constraints and instead allowed each hidden layer neuron to engage in plateau potentials at different times (Figure A.2). This demonstrates that strict synchrony in the plateau potentials is not required. But, our target definitions do require two different plateau potentials separated by the teaching signal input, which mandates some temporal control of plateau potentials in the system. Interestingly, we found that the addition of a second hidden layer further improved learning. The network with two hidden layers learned more rapidly than the network with one hidden layer and achieved an average error rate of 3.2% on the test images by the 60th epoch, also without hitting a clear asymptote in learning (Figure 3.6B, red line). However, it should be noted that additional hidden layers beyond two did not significantly improve the error rate (data not shown), which suggests that our particular algorithm could not be used to construct very deep networks as is. Nonetheless, our network was clearly able to take advantage of multi-layer architectures to improve its learning, which is the key feature of deep learning [117, 118]. Another key feature of deep learning is the ability to generate representations in the higher layers of a network that capture task-relevant information while discarding sensory details [118, 119]. To examine whether our network exhibited this type of abstraction, we used the t-Distributed Stochastic Neighbor Embedding algorithm (t-SNE). The t-SNE algorithm reduces the dimensionality of data while preserving Chapter 3. Project 1: Towards deep learning with segregated dendrites 34

Figure 3.6: Improvement of learning with hidden layers. (A) Illustration of the three networks used in the simulations. Top: a shallow network with only an input layer and an output layer. Middle: a network with one hidden layer. Bottom: a network with two hidden layers. Both hidden layers receive feedback from the output layer, but through separate synaptic connections with random weights Y 0 and Y 1.(B) Plot of test error (measured on 10,000 MNIST images not used for training) across 60 epochs of training, for all three networks described in A. The networks with hidden layers exhibit deep learning, because hidden layers decrease the test error. Right: Spreads (min – max) of the results of repeated weight tests (n = 20) after 60 epochs for each of the networks. Percentages indicate means −58 (two-tailed t-test, 1-layer vs. 2-layer: t38 = 197.11, P38 = 2.5 × 10 ; 1-layer vs. 3-layer: t38 = 238.26, −61 −33 P38 = 1.9 × 10 ; 2-layer vs. 3-layer: t38 = 42.99, P38 = 2.3 × 10 , Bonferroni correction for multiple comparisons). (C) Results of t-SNE dimensionality reduction applied to the activity patterns of the first three layers of a two hidden layer network (after 60 epochs of training). Each data point corresponds to a test image shown to the network. Points are color-coded according to the digit they represent. Moving up through the network, images from identical categories are clustered closer together and separated from images of different categories. Thus the hidden layers learn increasingly abstract representations of digit categories. Chapter 3. Project 1: Towards deep learning with segregated dendrites 35 local structure and non-linear manifolds that exist in high-dimensional space, thereby allowing accurate visualization of the structure of high-dimensional data [146]. We applied t-SNE to the activity patterns at each layer of the two hidden layer network for all of the images in the test set after 60 epochs of training. At the input level, there was already some clustering of images based on their categories. However, the clusters were quite messy, with different categories showing outliers, several clusters, or merged clusters (Figure 3.6C, bottom). For example, the ‘2’ digits in the input layer exhibited two distinct clusters separated by a cluster of ‘7’s: one cluster contained ‘2’s with a loop and one contained ‘2’s without a loop. Similarly, there were two distinct clusters of ‘4’s and ‘9’s that were very close to each other, with one pair for digits on a pronounced slant and one for straight digits (Figure 3.6C bottom, example images). Thus, although there is built-in structure to the categories of the MNIST dataset, there are a number of low-level features that do not respect category boundaries. In contrast, at the first hidden layer, the activity patterns were much cleaner, with far fewer outliers and split/merged clusters (Figure 3.6C, middle). For example, the two separate ‘2’ digit clusters were much closer to each other and were now only separated by a very small cluster of ‘7’s. Likewise, the ‘9’ and ‘4’ clusters were now distinct and no longer split based on the slant of the digit. Interestingly, when we examined the activity patterns at the second hidden layer, the categories were even better segregated, with only a little bit of splitting or merging of category clusters (Figure 3.6C, top). Therefore, the network had learned to develop representations in the hidden layers wherein the categories were very distinct and low-level features unrelated to the categories were largely ignored. This abstract representation is likely to be key to the improved error rate in the two hidden layer network. Altogether, our data demonstrates that our network with segregated dendritic compartments can engage in deep learning.

3.4.5 Coordinated local learning mimics backpropagation of error

The backpropagation of error algorithm [123] is still the primary learning algorithm used for deep super- vised learning in artificial neural networks [118]. Previous work has shown that learning with random feedback weights can actually match the synaptic weight updates specified by the backpropagation algo- rithm after a few epochs of training [46]. This fascinating observation suggests that deep learning with random feedback weights is not completely distinct from backpropagation of error, but rather, networks with random feedback connections learn to approximate credit assignment as it is done in backpropaga- tion [46]. Hence, we were curious as to whether or not our network was, in fact, learning to approximate the synaptic weight updates prescribed by backpropagation. To test this, we trained our one hidden layer network as before, but now, in addition to calculating the vector of hidden layer synaptic weight updates specified by our local learning rule (∆W 0 in equation (3.10)), we also calculated the vector of hidden layer synaptic weight updates that would be specified by non-locally backpropagating the error 0 from the output layer, (∆WBP ). We then calculated the angle between these two alternative weight updates. In a very high-dimensional space, any two independent vectors will be roughly orthogonal to 0 0 ◦ each other (i.e. ∆W ∠∆WBP ≈ 90 ). If the two synaptic weight update vectors are not orthogonal to 0 0 ◦ each other (i.e. ∆W ∠∆WBP < 90 ), then it suggests that the two algorithms are specifying similar weight updates. As in previous work [46], we found that the initial weight updates for our network were orthogonal to the updates specified by backpropagation. But, as the network learned the angle dropped to approx- imately 65◦, before rising again slightly to roughly 70◦ (Figure 3.7A, blue line). This suggests that our network was learning to develop local weight updates in the hidden layer that were in rough agreement Chapter 3. Project 1: Towards deep learning with segregated dendrites 36 with the updates that explicit backpropagation would produce. However, this drop in orthogonality was still much less than that observed in non-spiking artificial neural networks learning with random feedback weights, which show a drop to below 45◦ [46]. We suspected that the higher angle between the weight updates that we observed may have been because we were using spikes to communicate the feedback from the upper layer, which could introduce both noise and bias in the estimates of the output layer activity. To test this, we also examined the weight updates that our algorithm would produce if we propagated the spike rates of the output layer neurons, φ1(t), back directly through the random 0 0 feedback weights, Y . In this scenario, we observed a much sharper drop in the ∆W ∠∆WBP angle, which reduced to roughly 35◦ before rising again to 40◦ (Figure 3.7A, red line). These results show that, in principle, our algorithm is learning to approximate the backpropagation algorithm, though with some drop in accuracy introduced by the use of spikes to propagate output layer activities to the hidden layer.

Figure 3.7: Approximation of backpropagation with local learning rules. (A) Plot of the angle between weight updates prescribed by our local update learning algorithm compared to those prescribed by backpropagation of error, for a one hidden layer network over 10 epochs of training (each point on the horizontal axis corresponds to one image presentation). Data was time-averaged using a sliding window of 100 image presentations. When training the network using the local update learning algorithm, feedback was sent to the hidden layer either using spiking activity from the output layer units (blue) or by directly sending the spike rates of output units (red). The angle between the local update ∆W 0 0 ◦ and backpropagation weight updates ∆WBP remains under 90 during training, indicating that both algorithms point weight updates in a similar direction. (B) Examples of hidden layer receptive fields (synaptic weights) obtained by training the network in A using our local update learning rule (left) and backpropagation of error (right) for 60 epochs. (C) Plot of correlation between local update receptive fields and backpropagation receptive fields. For each of the receptive fields produced by local update, we plot the maximum Pearson correlation coefficient between it and all 500 receptive fields learned using backpropagation (Regular). Overall, the maximum correlation coefficients are greater than those obtained after shuffling all of the values of the local update receptive fields (Shuffled).

To further examine how our local learning algorithm compared to backpropagation we compared Chapter 3. Project 1: Towards deep learning with segregated dendrites 37 the low-level features that the two algorithms learned. To do this, we trained the one hidden layer network with both our algorithm and backpropagation. We then examined the receptive fields (i.e. the synaptic weights) produced by both algorithms in the hidden layer synapses (W 0) after 60 epochs of training. The two algorithms produced qualitatively similar receptive fields (Figure 3.7B). Both produced receptive fields with clear, high-contrast features for detecting particular strokes or shapes. To quantify the similarity, we conducted pair-wise correlation calculations for the receptive fields produced by the two algorithms and identified the maximum correlation pairs for each. Compared to shuffled versions of the receptive fields, there was a very high level of maximum correlation (Figure 3.7C), showing that the receptive fields were indeed quite similar. Thus, the data demonstrate that our learning algorithm using random feedback weights into segregated dendrites can in fact come to approximate the backpropagation of error algorithm.

3.4.6 Conditions on feedback weights

Once we had convinced ourselves that our learning algorithm was, in fact, providing a solution to the credit assignment problem, we wanted to examine some of the constraints on learning. First, we wanted to explore the structure of the feedback weights. In our initial simulations we used non-sparse, random (i.e. normally distributed) feedback weights. We were interested in whether learning could still work with sparse weights, given that neocortical connectivity is sparse. As well, we wondered whether symmetric weights would improve learning, which would be expected given previous findings [46, 110, 104]. To explore these questions, we trained our one hidden layer network using both sparse feedback weights (only 20% non-zero values) and symmetric weights (Y = W 1T ) (Figure 3.8A,C). We found that learning actually improved slightly with sparse weights (Figure 3.8B, red line), achieving an average error rate of 3.7% by the 60th epoch, compared to the average 4.1% error rate achieved with fully random weights. But, this result appeared to depend on the magnitude of the sparse weights. To compensate for the loss of 80% of the weights we initially increased the sparse synaptic weight magnitudes by a factor of 5. However, when we did not re-scale the sparse weights learning was actually worse (Figure A.3), though this could likely be dealt with by a careful resetting of learning rates. Altogether, our results suggest that sparse feedback provides a signal that is sufficient for credit assignment. Similar to sparse feedback weights, symmetric feedback weights also improved learning, leading to a rapid decrease in the test error and an error rate of 3.6% by the 60th epoch (Figure 3.8D, red line). This is interesting, given that backpropagation assumes symmetric feedback weights [46, 37], though our proof of Theorem A.1 does not. However, when we added noise to the symmetric weights any advantage was eliminated and learning was, in fact, slightly impaired (Figure 3.8D, blue line). At first, this was a very surprising result: given that learning works with random feedback weights, why would it not work with symmetric weights with noise? However, when we considered our previous finding that during the first epoch the feedforward weights, W 1, learn to have the feedforward Jacobian match the inverse of the feedback Jacobian (Figure A.1) a possible answer emerges. In the case of symmetric feedback weights T the synaptic matrix Y is changing as W 1 changes. This works fine when Y is set to W 1 , since that T artificially forces something akin to backpropagation. But, if the feedback weights are set to W 1 plus noise, then the system can never align the Jacobians appropriately, since Y is now a moving target. This would imply that any implementation of feedback learning must either be very effective (to achieve the right feedback) or very slow (to allow the feedforward weights to adapt). Chapter 3. Project 1: Towards deep learning with segregated dendrites 38

Figure 3.8: Conditions on feedback synapses for effective learning. (A) Diagram of a one hidden layer network trained in B, with 80% of feedback weights set to zero. The remaining feedback weights Y 0 were multiplied by 5 in order to maintain a similar overall magnitude of feedback signals. (B) Plot of test error across 60 epochs for our standard one hidden layer network (gray) and a network with sparse feedback weights (red). Sparse feedback weights resulted in improved learning performance compared to fully connected feedback weights. Right: Spreads (min – max) of the results of repeated weight tests (n = 20) after 60 epochs for each of the networks. Percentages indicate mean final test errors −19 for each network (two-tailed t-test, regular vs. sparse: t38 = 16.43, P38 = 7.4 × 10 ). (C) Diagram of a one hidden layer network trained in D, with feedback weights that are symmetric to feedforward weights W 1, and symmetric but with added noise. Noise added to feedback weights is drawn from a normal distribution with variance σ = 0.05. (D) Plot of test error across 60 epochs of our standard one hidden layer network (gray), a network with symmetric weights (red), and a network with symmetric weights with added noise (blue). Symmetric weights result in improved learning performance compared to random feedback weights, but adding noise to symmetric weights results in impaired learning. Right: Spreads (min – max) of the results of repeated weight tests (n = 20) after 60 epochs for each of the networks. Percentages indicate means (two-tailed t-test, random vs. symmetric: t38 = 18.46, −20 −41 P38 = 4.3 × 10 ; random vs. symmetric with noise: t38 = −71.54, P38 = 1.2 × 10 ; symmetric vs. −43 symmetric with noise: t38 = −80.35, P38 = 1.5×10 , Bonferroni correction for multiple comparisons).

3.4.7 Learning with partial apical attenuation

Another constraint that we wished to examine was whether total segregation of the apical inputs was necessary, given that real pyramidal neurons only show an attenuation of distal apical inputs to the soma

[60]. Total segregation (ga = 0) renders the network effectively feed-forward in its dynamics, which made it easier to construct the loss functions to ensure that reducing L0 also reduces L1 (see Figure 3.5 and Theorem A.1). But, we wondered whether some degree of apical conductance to the soma would be sufficiently innocuous so as to not disrupt deep learning. To examine this, we re-ran our two hidden layer network, but now, we allowed the apical dendritic voltage to influence the somatic voltage by Chapter 3. Project 1: Towards deep learning with segregated dendrites 39

setting ga = 0.05. This value gave us twelve times more attenuation than the attenuation from the basal compartments, since gb = 0.6 (Figure 3.9A). When we compared the learning in this scenario to the scenario with total apical segregation, we observed very little difference in the error rates on the test set (Figure 3.9B, gray and red lines). Importantly, though, we found that if we increased the apical conductance to the same level as the basal (ga = gb = 0.6) then the learning was significantly impaired (Figure 3.9B, blue line). This demonstrates that although total apical attenuation is not necessary, partial segregation of the apical compartment from the soma is necessary. That result makes sense given that our local targets for the hidden layer neurons incorporate a term that is supposed to reflect the response of the output neurons to the feedforward sensory information (αf ). Without some sort of separation of feedforward and feedback information, as is assumed in other models of deep learning [46, 110], this feedback signal would get corrupted by recurrent dynamics in the network. Our data show that electrotonically segregated dendrites is one potential way to achieve the separation between feedforward and feedback information that is required for deep learning.

Figure 3.9: Importance of dendritic segregation for deep learning. (A) Left: Diagram of a hidden layer neuron. ga represents the strength of the coupling between the apical dendrite and soma. Right: 0a 0 Example traces of the apical voltage in a single neuron Vi and the somatic voltage Vi in response to spikes arriving at apical synapses. Here ga = 0.05, so the apical activity is strongly attenuated at the soma. (B) Plot of test error across 60 epochs of training on MNIST of a two hidden layer network, with total apical segregation (gray), strong apical attenuation (red) and weak apical attenuation (blue). Apical input to the soma did not prevent learning if it was strongly attenuated, but weak apical attenuation impaired deep learning. Right: Spreads (min – max) of the results of repeated weight tests (n = 20) after 60 epochs for each of the networks. Percentages indicate means (two-tailed t-test, total segregation vs. −4 strong attenuation: t38 = −4.00, P38 = 8.4×10 ; total segregation vs. weak attenuation: t38 = −95.24, −46 −46 P38 = 2.4×10 ; strong attenuation vs. weak attenuation: t38 = −92.51, P38 = 7.1×10 , Bonferroni correction for multiple comparisons).

3.5 Discussion

Deep learning has radically altered the field of AI, demonstrating that parallel distributed process- ing across multiple layers can produce human/animal-level capabilities in image classification, pattern recognition and reinforcement learning [128, 118, 119, 120, 147, 121]. Deep learning was motivated by analogies to the real brain [118, 148], so it is tantalizing that recent studies have shown that deep neural networks develop representations that strongly resemble the representations observed in the mammalian neocortex [34, 32, 35, 36]. In fact, deep learning models can match cortical representations better than Chapter 3. Project 1: Towards deep learning with segregated dendrites 40 some models that explicitly attempt to mimic the real brain [34]. Hence, at a phenomenological level, it appears that deep learning, defined as multilayer cost function reduction with appropriate credit assign- ment, may be key to the remarkable computational prowess of the mammalian brain [122]. However, the lack of biologically feasible mechanisms for credit assignment in deep learning algorithms, most notably backpropagation of error [123], has left neuroscientists with a mystery. Given that the brain cannot use backpropagation, how does it solve the credit assignment problem (Figure 3.1)? Here, we expanded on an idea that previous authors have explored [134, 132, 135] and demonstrated that segregating the feedback and feedforward inputs to neurons, much as the real neocortex does [60, 133, 61], can enable the construction of local targets to assign credit appropriately to hidden layer neurons (Figure 3.2). With this formulation, we showed that we could use segregated dendritic compartments to coordinate learning across layers (Figure 3.3, Figure 3.4 and Figure 3.5). This enabled our network to take advantage of multiple layers to develop representations of hand-written digits in hidden layers that enabled better levels of classification accuracy on the MNIST dataset than could be achieved with a single layer (Figure 3.6). Furthermore, we found that our algorithm actually approximated the weight updates that would be prescribed by backpropagation, and produced similar low-level feature detectors (Figure 3.7). As well, we showed that our basic framework works with sparse feedback connections (Figure 3.8) and more realistic, partial apical attenuation (Figure 3.9). Therefore, our work demonstrates that deep learn- ing is possible in a biologically feasible framework, provided that feedforward and feedback signals are sufficiently segregated in different dendrites.

In this work we adopted a similar strategy to the one taken by Lee et al.’s (2015) difference target propagation algorithm, wherein the feedback from higher layers is used to construct local firing-rate targets at the hidden layers. One of the reasons that we adopted this strategy is that it is appealing to think that feedback from upper layers may not simply be providing a signal for plasticity, but also a predictive and/or modulatory signal to push the hidden layer neurons towards a “better” activity pattern in real-time. This sort of top-down control could be used by the brain to improve sensory processing in different contexts and engage in inference [37]. Indeed, framing cortico-cortical feedback as a mechanism to predict or modulate incoming sensory activity is a more common way of viewing feedback signals in the neocortex [63, 149, 150, 151, 152]. In light of this, it is interesting to note that distal apical inputs in somatosensory cortex can predict upcoming stimuli [152, 151], and help animals perform sensory discrimination tasks [153, 64]. However, in our model, we did not actually implement a system that altered the hidden layer activity to make sensory computations—we simply used the feedback signals to drive learning. In-line with this view of top-down feedback, two recent papers have found evidence that cortical feedback can indeed guide feedforward sensory plasticity [154, 155], and in the hippocampus, there is evidence that plateau potentials generated by apical inputs are key determinants of plasticity [136, 137]. But, ultimately, there is no reason that feedback signals cannot provide both top-down predicton/modulation and a signal for learning [132]. In this respect, a potential future advance on our model would be to implement a system wherein the feedback makes predictions and “nudges” the hidden layers towards appropriate activity patterns in order to guide learning and shape perception simultaneously. This proposal is reminiscent of the approach taken in previous computational models [111, 135, 134]. Future research could study how top-down control of activity and a signal for credit assignment can be combined.

In a number of ways, the model that we presented here is more biologically feasible than other deep learning models. We utilized leaky integrator neurons that communicate with spikes, we simulated in Chapter 3. Project 1: Towards deep learning with segregated dendrites 41 near continuous-time, and we used spatially local synaptic plasticity rules. Yet, there are still clearly unresolved issues of biological feasibility in our model. Most notably, the model updates synaptic weights using the difference between two plateau potentials that occur following two different phases. There are three issues with this method from a biological standpoint. First, it necessitates two distinct global phases of processing (the “forward” and “target” phases). Second, the plateau potentials occur in the apical compartment, but they are used to update the basal synapses, meaning that this information from the apical dendrites must somehow be communicated to the rest of the neuron. Third, the two plateau potentials occur with a temporal gap of tens of milliseconds, meaning that this difference must somehow be computed over time.

These issues could, theoretically, be resolved in a biologically realistic manner. The two different phases could be a result of a global signal indicating whether the teaching signal was present. This could be accomplished with neuromodulatory systems [141], or alternatively, with oscillations that the teaching signal and apical dendrites are phase locked to [156]. Communicating plateau potentials to the basal dendrites is also possible using known biological principles. Plateau potentials induce bursts of action potentials in pyramidal neurons [60], and the rate-of-fire of the bursts would be a function of the level of the plateau potential. Given that action potentials would propagate back through the basal dendrites [157], any cellular mechanism in the basal dendrites that is sensitive to rate-of-fire of bursts could be used to detect the level of the plateau potentials in the apical dendrite. Finally, taking the difference between two events that occur tens of milliseconds apart is possible if such a hypothetical cellular signal that is sensitive to bursts had a slow decay time constant, and reacted differently depending on whether the global phase signal was active. A simple mathematical formulation for such a cellular signal is given in the Methods (see equations (3.36) and (3.37)). It is worth noting that incorporation of bursting into somatic dynamics would be unlikely to affect the learning results we presented here. This is because we calculate weight updates by averaging the activity of the neurons for a period after the network is near steady-state (i.e. the period marked with the blue line in Figure 3.3C, see also equation (3.5)). Even if bursts of activity temporarily altered the dynamics of the network, they would not significantly alter the steady-state activity. Future work could expand on the model presented here and explore whether bursting activity might beneficially alter somatic dynamics (e.g. for on-line inference), as well as driving learning.

These possible implementations are clearly speculative, and only partially in-line with experimental evidence. As the adage goes, all models are wrong, but some models are useful. Our model aims to inspire new ways to think about how the credit assignment problem could be solved by known circuits in the brain. Our study demonstrates that some of the machinery that is known to exist in the neocortex, namely electrotonically segregated apical dendrites receiving top-down inputs, may be well-suited to credit assignment computations. What we are proposing is that the neocortex could use the segregation of top-down inputs to the apical dendrites in order to solve the credit assignment problem, without using a separate feedback pathway as is implicit in most deep learning models used in machine learning. We consider this to be the core insight of our model, and an important step in making deep learning more biologically plausible. Indeed, our model makes both a generic, and a specific, prediction about the role of synaptic inputs to apical dendrites during learning. The generic prediction is that the sign of synaptic plasticity, i.e. whether LTP or LTD occur, in the basal dendrites will be modulated by different patterns of inputs to the apical dendrites. The more specific prediction that our model makes is that the timing of apical inputs relative to basal inputs should be what determines the sign of plasticity for synapses Chapter 3. Project 1: Towards deep learning with segregated dendrites 42 in the basal dendrites. For example, if apical and basal inputs arrive at the same time, but the apical inputs disappear before the basal inputs do, then presumably plateau potentials will be stronger early in the stimulus presentation (i.e. αf > αt), and so the basal synapses should engage in LTD. In contrast, if the apical inputs only arrive after the basal inputs have been active for some period of time, then plateau potentials will be stronger towards the end of stimulus presentation (i.e. αf < αt), and so the basal synapses should engage in LTP. Both the generic and specific predictions should be experimentally testable using modern optical techniques to separate the inputs to the basal and apical dendrites (Figure 3.10).

Figure 3.10: An experiment to test the central prediction of the model. (A) Illustration of the basic experimental set-up required to test the predictions (generic or specific) of the deep learning with segregated dendrites model. To test the predictions of the model, patch clamp recordings could be performed in neocortical pyramidal neurons (e.g. layer 5 neurons, shown in black), while the top-down inputs to the apical dendrites and bottom-up inputs to the basal dendrites are controlled separately. This could be accomplished optically, e.g. by infecting layer 4 cells with channelrhodopsin (blue cell), and a higher-order cortical region with a red-shifted opsin (red axon projections), such that the two inputs could be controlled by different colors of light. (B) Illustration of the specific experimental prediction of the model. With separate control of top-down and bottom-up inputs a synaptic plasticity experiment could be conducted to test the central prediction of the model, i.e. that the timing of apical inputs relative to basal inputs should determine the sign of plasticity at basal dendrites. After recording baseline postsynaptic responses (black lines) to the basal inputs (blue lines) a plasticity induction protocol could either have the apical inputs (red lines) arrive early during basal inputs (left) or late during basal inputs (right). The prediction of our model would be that the former would induce LTD in the basal synapses, while the later would induce LTP.

Another direction for future research should be to consider how to use the machinery of neocortical microcircuits to communicate credit assignment signals without relying on differences across phases, as we did here. For example, somatostatin positive interneurons, which possess short-term facilitating synapses [158], are particularly sensitive to bursts of spikes, and could be part of a mechanism to calculate differences in the top-down signals being received by pyramidal neuron dendrites. If a calculation of this Chapter 3. Project 1: Towards deep learning with segregated dendrites 43 difference spanned the time before and after a teaching signal arrived, it could, theoretically, provide the computation that our system implements with a difference between plateau potentials. Indeed, we would argue that credit assignment may be one of the major functions of the canonical neocortical microcircuit motif. If this is correct, then the inhibitory interneurons that target apical dendrites may be used by the neocortex to control learning [139]. Although this is speculative, it is worth noting that current evidence supports the idea that neuromodulatory inputs carrying temporally precise salience information [143] can shut off interneurons to disinhibit the distal apical dendrites [141, 49, 142, 144], and presumably, promote apical communication to the soma. Recent work suggests that the specific patterns of interneuron inhibition on the apical dendrites are spatially precise and differentially timed to motor behaviours [159], which suggests that there may well be coordinated physiological mechanisms for determining when and how cortico-cortical feedback is transmitted to the soma and basal dendrites. Future research should examine whether these inhibitory and neuromodulatory mechanisms do, in fact, control plasticity in the basal dendrites of pyramidal neurons, as our model, and some recent experimental work [136, 137], would predict.

A non-biological issue that should be recognized is that the error rates which our network achieved were by no means as low as can be achieved with artificial neural networks, nor at human levels of performance [160, 161]. As well, our algorithm was not able to take advantage of very deep structures (beyond two hidden layers, the error rate did not improve). In contrast, increasing the depth of networks trained with backpropagation can lead to performance improvements [161]. But, these observations do not mean that our network was not engaged in deep learning. First, it is interesting to note that although the backpropagation algorithm is several decades old [123], it was long considered to be useless for training networks with more than one or two hidden layers [117]. Indeed, it was only the use of layer-by-layer training that initially led to the realization that deeper networks can achieve excellent performance [128]. Since then, both the use of very large datasets (with millions of examples), and additional modifications to the backpropagation algorithm, have been key to making backpropagation work well on deeper networks [162, 118]. Future studies could examine how our algorithm could incorporate current techniques used in machine learning to work better on deeper architectures. Second, we stress that our network was not designed to match the state-of-the-art in machine learning, nor human capabilities. To test our basic hypothesis (and to run our leaky-integration and spiking simulations in a reasonable amount of time) we kept the network small, we stopped training before it reached its asymptote, and we did not implement any add-ons to the learning to improve the error rates, such as convolution and pooling layers, initialization tricks, mini-batch training, drop-out, momentum or RMSProp [162, 163, 164]. Indeed, it would be quite surprising if a relatively vanilla, small network like ours could come close to matching current performance benchmarks in machine learning. Third, although our network was able to take advantage of multiple layers to improve the error rate, there may be a variety of reasons that ever increasing depth didn’t improve performance significantly. For example, our use of direct connections from the output layer to the hidden layers may have impaired the network’s ability to coordinate synaptic updates between hidden layers. As well, given our finding that the use of spikes produced weight updates that were less well-aligned to backpropagation (Figure 3.7A) it is possible that deeper architectures require mechanisms to overcome the inherent noisiness of spikes.

One aspect of our model that we did not develop was the potential for learning at the feedback synapses. Although we used random synaptic weights for feedback, we also demonstrated that our model actually learns to meet the mathematical conditions required for credit assignment (Figure A.1). Chapter 3. Project 1: Towards deep learning with segregated dendrites 44

This suggests that it would be beneficial to develop a synaptic weight update rule for the feedback synapses that made this aspect of the learning better. Indeed, [110] implemented an “inverse loss function” for their feedback synapses which promoted the development of feedforward and feedback functions that were roughly inverses of each other, leading to the emergence of auto-encoder functions in their network. In light of this, it is interesting to note that there is evidence for unique, “reverse” spike- timing-dependent synaptic plasticity rules in the distal apical dendrites of pyramidal neurons [165, 17], which have been shown to produce symmetric feedback weights and auto-encoder functions in artificial spiking networks [166, 167]. Thus, it is possible that early in development the neocortex actually learns cortico-cortical feedback connections that help it to assign credit for later learning. Our work suggests that any experimental evidence showing that feedback connections learn to approximate the inverse of feedforward connections could be considered as evidence for deep learning in the neocortex. A final consideration, which is related to learning at feedback synapses, is the likely importance of unsupervised learning for the real brain, i.e. learning without a teaching signal. In this paper, we focused on a supervised learning task with a teaching signal. Supervised learning certainly could occur in the brain, especially for goal-directed sensorimotor tasks where animals have access to examples that they could use to generate internal teaching signals [168]. But, unsupervised learning is likely critical for understanding the development of cognition [122]. Importantly, unsupervised learning in multilayer networks still requires a solution to the credit assignment problem [37], so our work here is not completely inapplicable. Nonetheless, future research should examine how the credit assignment problem can be addressed in the specific case of unsupervised learning. In summary, deep learning has had a huge impact on AI, but, to date, its impact on neuroscience has been limited. Nonetheless, given a number of findings in neurophysiology and modeling [32], there is growing interest in understanding how deep learning may actually be achieved by the real brain [122]. Our results show that by moving away from point neurons, and shifting towards multi-compartment neurons that segregate feedforward and feedback signals, the credit assignment problem can be solved and deep learning can be achieved. Perhaps the dendritic anatomy of neocortical pyramidal neurons is important for nature’s own deep learning algorithm. Chapter 3. Project 1: Towards deep learning with segregated dendrites 45

3.6 Methods

Code for the model can be obtained from a GitHub repository (https://github.com/jordan-g/Segregated- Dendrite-Deep-Learning) [169]. For notational simplicity, we describe our model in the case of a network with only one hidden layer. We describe how this is extended to a network with multiple layers at the end of this section. As well, at the end of this section in Table A.1 we provide a table listing the parameter values we used for all of the simulations presented in this paper.

3.6.1 Neuronal dynamics

The network described here consists of an input layer with ` neurons, a hidden layer with m neurons, and an output layer with n neurons. Neurons in the input layer are simple Poisson spiking neurons whose rate-of-fire is determined by the intensity of image pixels (ranging from 0 - φmax). Neurons in the hidden layer are modeled using three functional compartments—basal dendrites with voltages 0b 0b 0b 0b 0a 0a 0a 0a V (t) = [V1 (t),V2 (t), ..., Vm (t)], apical dendrites with voltages V (t) = [V1 (t),V2 (t), ..., Vm (t)], 0 0 0 0 and somata with voltages V (t) = [V1 (t),V2 (t), ..., Vm(t)]. Feedforward inputs from the input layer and feedback inputs from the output layer arrive at basal and apical synapses, respectively. At basal synapses, presynaptic spikes from input layer neurons are translated into filtered spike trains sinput(t) = input input input [s1 (t), s2 (t), ..., s` (t)] given by:

input X input sj (t) = κ(t − tjk ) (3.11) k

input where tjk is the kth spike time of input neuron j. κ(t) is the response kernel given by:

−t/τL −t/τs κ(t) = (e − e )Θ(t)/(τL − τs) (3.12)

where τs and τL are short and long time constants, and Θ is the Heaviside step function. Since the network is fully-connected, each neuron in the hidden layer will receive the same set of filtered spike trains 1 1 1 1 from input layer neurons. The filtered spike trains at apical synapses, s (t) = [s1(t), s2(t), ..., sn(t)], are modeled in the same manner. The basal and apical dendritic potentials for neuron i are then given by weighted sums of the filtered spike trains at either its basal or apical synapses:

` 0b X 0 input 0 Vi (t) = Wijsj (t) + bi j=1 (3.13) n 0a X 1 Vi (t) = Yijsj (t) j=1

0 0 0 0 0 where b = [b1, b2, ..., bm] are bias terms, W is the m × ` matrix of feedforward weights for neurons in the hidden layer, and Y is the m×n matrix of their feedback weights. The somatic voltage for neuron i evolves with leak as: Chapter 3. Project 1: Towards deep learning with segregated dendrites 46

0 dVi (t) R 0 gb 0b 0 ga 0a 0 τ = (V − Vi (t)) + (Vi (t) − Vi (t)) + (Vi (t) − Vi (t)) (3.14) dt gl gl ` n gb  X  ga  X  = (V R − V 0(t)) + W 0 sinput(t) + b0 − V 0(t) + Y 0s1(t) − V 0(t) (3.15) i g ij j i i g ij j i l j=1 l j=1

R where V is the resting potential, gl is the leak conductance, gb is the conductance from the basal dendrite to the soma, and ga is the conductance from the apical dendrite to the soma, and τ is a function of gl and the membrance capacitance Cm:

C τ = m (3.16) gl

Note that for simplicity’s sake we are assuming a resting potential of 0 mV and a membrane capaci- tance of 1 F, but these values are not important for the results. Equations (3.13) and (3.14) are identical to the equation (3.1) in Results. 0 0 0 0 The instantaneous firing rates of neurons in the hidden layer are given by φ (t) = [φ1(t), φ2(t), ..., φm(t)], 0 0 where φi (t) is the result of applying a nonlinearity, σ(·), to the somatic potential Vi (t). We chose σ(·) to be a simple sigmoidal function, such that:

1 φ0(t) = φ σ(V 0(t)) = φ (3.17) i max i max −V 0(t) 1 + e i

Here, φmax is the maximum possible rate-of-fire for the neurons, which we set to 200 Hz. Note that equation (3.17) is identical to equation (3.3) in Results. Spikes are then generated using Poisson processes with these firing rates. We note that although the maximum rate was 200 Hz, the neurons rarely achieved anything close to this rate, and the average rate of fire in the neurons during our simulations was 24 Hz. Units in the output layer are modeled using only two compartments, dendrites with voltages V 1b(t) = 1b 1b 1b 1 1 1 1 1b [V1 (t),V2 (t), ..., Vn (t)] and somata with voltages V (t) = [V1 (t),V2 (t), ..., Vn (t)]. Vi (t) is given by:

m 1b X 1 0 1 Vi (t) = Wijsj (t) + bi (3.18) j=1

0 0 0 0 where s (t) = [s1(t), s2(t), ..., sm(t)] are the filtered presynaptic spike trains at synapses that receive feedforward input from the hidden layer, and are calculated in the manner described by equation (3.11). 1 Vi (t) evolves as:

1 dVi (t) R 1 gd 1b 1 τ = (V − Vi (t)) + (Vi (t) − Vi (t)) + Ii(t) (3.19) dt gl

where gl is the leak conductance, gd is the conductance from the dendrite to the soma, and I(t) =

[I1(t),I2(t), ..., In(t)] are somatic currents that can drive output neurons toward a desired somatic voltage.

For neuron i, Ii is given by: Chapter 3. Project 1: Towards deep learning with segregated dendrites 47

1 1 Ii(t) = gEi (t)(EE − Vi (t)) + gIi (t)(EI − Vi (t)) (3.20)

where gE(t) = [gE1 (t), gE2 (t), ..., gEn (t)] and gI (t) = [gI1 (t), gI2 (t), ..., gIn (t)] are time-varying exci- tatory & inhibitory nudging conductances, and EE and EI are the excitatory and inhibitory reversal potentials. In our simulations, we set EE = 8 V and EI = −8 V. During the target phase only, we set gIi = 1 and gEi = 0 for all units i whose output should be minimal, and gEi = 1 and gIi = 0 for the unit whose output should be maximal. In this way, all units other than the “target” unit are silenced, while the “target” unit receives a strong excitatory drive. In the forward phase, I(t) is set to 0. The Poisson 1 1 1 1 spike rates φ (t) = [φ1(t), φ2(t), ..., φn(t)] are calculated as in equation (3.17).

3.6.2 Plateau potentials

f f f f At the end of the forward and target phases, we calculate plateau potentials α = [α1 , α2 , ..., αm] and t t t t f t α = [α1, α2, ..., αm] for apical dendrites of hidden layer neurons, where αi and αi are given by:

! Z t1 f 1 0a αi = σ α Vi (t)dt t1 t0+∆ts ! (3.21) Z t2 t 1 0a αi = σ α Vi (t)dt t2 t1+∆ts

where t1 and t2 are the end times of the forward and target phases, respectively, ∆ts = 30 ms is the α α settling time for the voltages, and t1 and t2 are given by:

α t1 = t1 − (t0 + ∆ts) (3.22) α t2 = t2 − (t1 + ∆ts)

Note that equation (3.21) is identical to equation (3.5) in Results. These plateau potentials are used by hidden layer neurons to update their basal weights.

3.6.3 Weight updates

All feedforward synaptic weights are updated at the end of each target phase. Output layer units update their synaptic weights W 1 in order to minimize the loss function

f 1 1∗ 1 2 L = ||φ − φmaxσ(V )||2 (3.23)

t where φ1∗ = φ1 as in equation (3.6). Note that, as long as neuronal units calculate averages after the network has reached a steady state, and the firing-rates of the neurons are in the linear region of the sigmoid function, then for layer x, Chapter 3. Project 1: Towards deep learning with segregated dendrites 48

f xf x φmaxσ(V ) ≈ φmaxσ(V ) (3.24) f = φx

Thus,

t f 1 1 1 2 L ≈ ||φ − φ ||2 (3.25)

as in equation (3.7).

All average voltages are calculated after a delay ∆ts from the start of a phase, which allows for the network to reach a steady state before averaging begins. In practice this means that the average somatic f 1 voltage for output layer neuron i in the forward phase, Vi , has the property

m f f f 1 1b X 1 0 1 Vi ≈ kdVi = kd Wijsj + bi (3.26) j=1

where kd is given by:

gd kd = (3.27) gl + gd

Thus,

1 f f f ∂L 1∗ 1 0 1 0 1 ≈ −kdφmax(φ − φmaxσ(V ))σ (V ) ◦ s ∂W (3.28) ∂L1 f f ≈ −k φ (φ1∗ − φ σ(V 1 ))σ0(V 1 ) ∂b1 d max max

Note that these partial derivatives assume that the activity during the target phase is fixed. We do this because the goal of learning is to have the network behave as it does during the target phase, even when the teaching signal is present. Thus, we do not update synapses in order to alter the target phase activity. As a result, there are no terms in the equation related to the partial derivatives of the voltages or firing-rates during the target phase. The dendrites in the output layer use this approximation of the gradient in order to update their weights using gradient descent:

1 1 1 1 1 ∂L W → W − η P 1 ∂W (3.29) ∂L1 b1 → b1 − η1P 1 ∂b1

where η1 is a learning rate constant, and P 1 is a scaling factor used to normalize the scale of the rate-of-fire function. Chapter 3. Project 1: Towards deep learning with segregated dendrites 49

In the hidden layer, basal dendrites update their synaptic weights W 0 by minimizing the loss function

f 0 0∗ 0 2 L = ||φ − φmaxσ(V )||2 (3.30)

0∗ 0∗ 0∗ 0∗ We define the target rates-of-fire φ = [φ1 , φ2 , ..., φm ] such that

f 0∗ 0 t f φi = φi + αi − αi (3.31)

f f f f t t t t where α = [α1 , α2 , ..., αm] and α = [α1, α2, ..., αm] are forward and target phase plateau potentials given in equation (3.21). Note that equation (3.31) is identical to equation (3.8) in Results. These hidden layer target firing rates are similar to the targets used in difference target propagation [110]. Using equation (3.24), we can show that

0 t f 2 L ≈ ||α − α ||2 (3.32)

as in equation (3.9). Hence:

0 f f ∂L t f 0 0 input 0 ≈ −kb(α − α )φmaxσ (V ) ◦ s ∂W (3.33) ∂L0 f ≈ −k (αt − αf )φ σ0(V 0 ) ∂b0 b max

where kb is given by:

gb kb = (3.34) gl + gb + ga

Note that although φ0∗ is a function of W 0 and b0, we do not differentiate this term with respect to the weights and biases. Instead, we treat φ0∗ as a fixed state for the hidden layer neurons to learn to reproduce. Basal weights are updated in order to descend this approximation of the gradient:

0 0 0 0 0 ∂L W → W − η P 0 ∂W (3.35) ∂L0 b0 → b0 − η0P 0 ∂b0

Again, we assume that the activity during the target phase is fixed, so no derivatives are taken with respect to voltages or firing-rates during the target phase. Importantly, this update rule is spatially local for the hidden layer neurons. It consists essentially of three terms, (1) the difference in the plateau potentials for the target and forward phases (αt − αf ), f 0 0 (2) the derivative of the spike rate function (φmaxσ (V )), and (3) the filtered presynaptic spike trains f (sinput ). All three of these terms are values that a real neuron could theoretically calculate using some Chapter 3. Project 1: Towards deep learning with segregated dendrites 50 combination of molecular synaptic tags, calcium currents, and back-propagating action potentials. One aspect of this update rule that is biologically questionable, though, is the use of the term (αt−αf ). This requires a difference between plateau potentials that are separated by tens of milliseconds. How could such a signal be used by basal dendrite synapses to guide their updates? Plateau potentials can drive bursts of spikes [60], which can propagate to basal dendrites [157]. Since the plateau potentials are similar to rate variables (i.e. a sigmoid applied to the voltage), the number of spikes during the bursts, f f f f t t t t N = [N1 ,N2 , ..., Nm] and N = [N1,N2, ..., Nm], for the forward and target plateaus, respectively, could be sampled from a Poisson distribution with rate parameter equal to the plateau potential level:

N f ∼ P oisson(αf ) (3.36) N t ∼ P oisson(αt)

If the distinct phases (forward and target) were marked by some global signal, φ(t), that was com- municated to all of the neurons, e.g. a neuromodulatory signal, the phase of a global oscillation, or some blanket inhibition signal, then we can imagine an internal cellular memory mechanism in the basal th dendrites of the i neuron, Mi (e.g. a molecular signal like the activity of an enzyme, the phosphoryla- tion level of some protein, or the amount of calcium released from intracellular stores), which could be differentially sensitive to the inter-spike interval of bursts, depending on φ. So, for example, if we define:

 −1, if in the forward phase, i.e. x = f φ(t) = 1, if in the target phase, i.e. x = t (3.37) dM (t) i ∝ φ(t)N x dt i

where x indicates the forward or target phase. Then, the change in Mi from before the bursts occur to afterwards would be, on average, proportional to the difference (αt − αf ), and could be used to calculate the weight updates. However, this is highly speculative, and it is not clear that such a mechanism would be present in real neurons. We have outlined the mathematics here to make the reality of implementing the current model explicit, but we would predict that the brain would have some alternative method for calculating differences between top-down inputs at different times, e.g. by using somatostatin positive interneurons that are preferentially sensitive to bursts and which target the apical dendrite [158]. We are ultimately agnostic as to this mechanism, and so, it was not included in the current model.

3.6.4 Multiple hidden layers

In order to extend our algorithm to deeper networks with multiple hidden layers, our model incorporates direct synaptic connections from the output layer to each hidden layer. Thus, each hidden layer receives feedback from the output layer through its own separate set of fixed, random weights. For example, in a network with two hidden layers, both layers receive the feedback from the output layer at their apical dendrites through backward weights Y 0 and Y 1. The local targets at each layer are then given by: Chapter 3. Project 1: Towards deep learning with segregated dendrites 51

∗ t φ2 = φ2 (3.38) t t f φ1∗ = φ1 + α1 − α1 (3.39) t t f φ0∗ = φ0 + α0 − α0 (3.40)

where the superscripts 0 and 1 denote the first and second hidden layers, respectively, and the superscript 2 denotes the output layer. The local loss functions at each layer are:

∗ f 2 2 2 2 L = ||φ − φmaxσ(V )||2 f 1 1∗ 1 2 L = ||φ − φmaxσ(V )||2 (3.41) f 0 0∗ 0 2 L = ||φ − φmaxσ(V )||2

where L2 is the loss at the output layer. The learning rules used by the hidden layers in this scenario are the same as in the case with one hidden layer.

3.6.5 Learning rate optimization

For each of the three network sizes that we present in this paper, a grid search was performed in order to find good learning rates. We set the learning rate for each layer by stepping through the range [0.1, 0.3] with a step size of 0.02. For each combination of learning rates, a neural network was trained for one epoch on the 60, 000 training examples, after which the network was tested on 10,000 test images. The learning rates that gave the best performance on the test set after an epoch of training were used as a basis for a second grid search around these learning rates that used a smaller step size of 0.01. From this, the learning rates that gave the best test performance after 20 epochs were chosen as our learning rates for that network size. In all of our simulations, we used a learning rate of 0.19 for a network with no hidden layers, learning rates of 0.21 (output and hidden) for a network with one hidden layer, and learning rates of 0.23 (hidden layers) and 0.12 (output layer) for a network with two hidden layers. All networks with one hidden layer had 500 hidden layer neurons, and all networks with two hidden layers had 500 neurons in the first hidden layer and 100 neurons in the second hidden layer.

3.6.6 Training paradigm

For all simulations described in this paper, the neural networks were trained on classifying handwritten digits using the MNIST database of 28 pixel × 28 pixel images. Initial feedforward and feedback weights were chosen randomly from a uniform distribution over a range that was calculated to produce voltages in the dendrites between −6 - 12 V. Prior to training, we tested a network’s initial performance on a set of 10,000 test examples. This set of images was shuffled at the beginning of testing, and each example was shown to the network in sequence. Each input image was encoded into Poisson spiking activity of the 784 input neurons representing each pixel of the image. The firing rate of an input neuron was proportional to the brightness of the pixel that Chapter 3. Project 1: Towards deep learning with segregated dendrites 52

it represents (with spike rates between [0 - φmax]. The spiking activity of each of the 784 input neurons was received by the neurons in the first hidden layer. For each test image, the network underwent only a forward phase. At the end of this phase, the network’s classification of the input image was given by the neuron in the output layer with the greatest somatic potential (and therefore the greatest spike rate). The network’s classification was compared to the target classification. After classifying all 10,000 testing examples, the network’s classification error was given by the percentage of examples that it did not classify correctly. Following the initial test, training of the neural network was done in an on-line fashion. All 60,000 training images were randomly shuffled at the start of each training epoch. The network was then shown each training image in sequence, undergoing a forward phase ending with a plateau potential, and a target phase ending with another plateau potential. All feedforward weights were then updated at the end of the target phase. At the end of the epoch (after all 60,000 images were shown to the network), the network was again tested on the 10,000 test examples. The network was trained for up to 60 epochs.

3.6.7 Simulation details

For each training example, a minimum length of 50 ms was used for each of the forward and target phases. The lengths of the forward and target training phases were determined by adding their minimum length to an extra length term, which was chosen randomly from a Wald distribution with a mean of 2 ms and scale factor of 1. During testing, a fixed length of 500 ms was used for the forward transmit phase.

Average forward and target phase voltages were calculated after a settle duration of ∆ts = 30 ms from the start of the phase. For simulations with randomly sampled plateau potential times (Figure A.1), the time at which each neuron’s plateau potential occurred was randomly sampled from a folded normal distribution (µ = 0, σ2 = 3) that was truncated (max = 5) such that plateau potentials occurred between 0 ms and 5 ms before the start of the next phase. In this scenario, the average apical voltage in the last 30 ms was averaged in the calculation of the plateau potential for a particular neuron. The time-step used for simulations was dt = 1 ms. At each time-step, the network’s state was updated bottom-to-top beginning with the first hidden layer and ending with the output layer. For each layer, dendritic potentials were updated, followed by somatic potentials, and finally their spiking activity. Table A.1 lists the simulation parameters and the values that were used in the figures presented. All code was written using the Python programming language version 2.7 (RRID: SCR 008394) with the NumPy (RRID: SCR 008633) and SciPy (RRID: SCR 008058) libraries. The code is open source and is freely available at https://github.com/jordan-g/Segregated-Dendrite-Deep-Learning [169]. The data used to train the network was from the Mixed National Institute of Standards and Tech- nology (MNIST) database, which is a modification of the original database from the National Insti- tute of Standards and Technology (RRID: SCR 006440) [160]. The MNIST database can be found at http://yann.lecun.com/exdb/mnist/. Some of the simulations were run on the SciNet High-Performance Computing platform [170].

3.7 Acknowledgments

We would like to thank Douglas Tweed, Jo˜aoSacramento, and Yoshua Bengio for helpful discussions on this work. This research was supported by three grants to B.A.R.: a Discovery Grant from the Natural Chapter 3. Project 1: Towards deep learning with segregated dendrites 53

Sciences and Engineering Research Council of Canada (RGPIN-2014-04947), a 2016 Google Faculty Research Award, and a Fellowship with the Canadian Institute for Advanced Research. The authors declare no competing financial interests. Some simulations were performed on the gpc supercomputer at the SciNet HPC Consortium. SciNet is funded by: the Canada Foundation for Innovation under the auspices of Compute Canada; the Government of Ontario; Ontario Research Fund - Research Excellence; and the University of Toronto. Chapter 4

Project 2: Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits

Alexandre Payeur1•, Jordan Guerguiev2,3•, Friedemann Zenke4, Blake A. Richards5,6,7,8∗•, and Richard Naud1,9,∗•

1 uOttawa Brain and Mind Institute, Centre for Neural Dynamics, Department of Cellular and Molec- ular Medicine, University of Ottawa, Ottawa, ON, Canada 2 Department of Biological Sciences, University of Toronto Scarborough, Toronto, ON, Canada 3 Department of Cell and Systems Biology, University of Toronto, Toronto, ON, Canada 4 Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland 5 Mila, Montr´eal,QC, Canada 6 Department of Neurology and Neurosurgery, McGill University, Montr´eal,QC, Canada 7 School of Computer Science, McGill University, Montr´eal,QC, Canada 8 Learning in Machines and Brains Program, Canadian Institute for Advanced Research, Toronto, ON, Canada 9 Department of Physics, University of Ottawa, Ottawa, ON, Canada ∗ Corresponding authors, email: [email protected], [email protected]

• Equal contributions

This chapter consists of a manuscript that has been accepted for publication in Nature Neuroscience, and is available as a preprint on bioRxiv [44].

While the full work is presented here for completeness, note that the rate-based model and simulations (presented in Sections 4.4.4, 4.4.5 and 4.6.2, and Figs. 4.5 and 4.6) were done by Jordan Guerguiev and are the focus of this thesis, while the spike-based model and simulations (presented in Sections 4.4.1-4.4.4 and 4.6.1, and Figs. 4.2-4.4) were implemented by Alexandre Payeur.

54 Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 55

4.1 Abstract

Synaptic plasticity is believed to be a key physiological mechanism for learning. It is well-established that it depends on pre and postsynaptic activity. However, models that rely solely on pre and postsynaptic activity for synaptic changes have, to date, not been able to account for learning complex tasks that demand credit assignment in hierarchical networks. Here, we show that if synaptic plasticity is regulated by high-frequency bursts of spikes, then neurons higher in a hierarchical circuit can coordinate the plasticity of lower-level connections. Using simulations and mathematical analyses, we demonstrate that, when paired with short-term synaptic dynamics, regenerative activity in the apical dendrites, and synaptic plasticity in feedback pathways, a burst-dependent learning rule can solve challenging tasks that require deep network architectures. Our results demonstrate that well-known properties of dendrites, synapses, and synaptic plasticity are sufficient to enable sophisticated learning in hierarchical circuits.

4.2 Author contributions

All authors contributed to the development of the burst-dependent learning rule. Jordan Guerguiev developed the rate-based model as well as the learning rules for recurrent and feedback weights, and performed all experiments involving training this model on MNIST, CIFAR-10 and ImageNet tasks. Alexandre Payeur developed the spiking model and performed all experiments involving this model. He also contributed to writing the manuscript. Friedemann Zenke served as an advisor and assisted with writing the manuscript. Blake A. Richards is the thesis supervisor for JG and provided guidance and support for the development of the models and experiments. He also contributed to writing the manuscript. Richard Naud developed the burst multiplexing framework and contributed to model development and validation, as well as writing of the manuscript.

4.3 Introduction

The current canonical model of synaptic plasticity in the cortex is based on the co-occurrence of activ- ity on the two sides of the synapse, pre and postsynaptic [171, 7]. The occurrence of either long-term depression (LTD) or long-term potentiation (LTP) is controlled by specific features of pre and postsy- naptic activity [172, 173, 174, 68, 83, 17, 67, 175, 113, 69] and a more global state of neuromodulation [176, 177, 178, 179, 180, 181, 182, 183, 184, 7]. However, local learning rules by themselves do not provide a guarantee that behavioral metrics will improve. With neuromodulation driven by an external reward/punishment mechanism, this guarantee is achievable [185]. But, such learning is very slow in tasks that require large or deep networks because a global signal provides very limited information to neurons deep in the hierarchy [28, 186, 187]. Thus, an outstanding question is (Fig. 4.1): how can neurons high-up in a hierarchy signal to other neurons — sometimes multiple synapses lower — whether to engage in LTP or LTD in order to improve behavior [7]? This question is sometimes referred to as the “credit assignment problem”: essentially, how can we assign credit for any errors or successes to neurons that are multiple synapses away from the output [27]? In machine learning, the credit assignment problem is typically solved with the backpropagation-of- error algorithm (backprop [188]), which explicitly uses gradient information in a biologically implausible manner [187] to calculate synaptic weight updates. Many previous studies have attempted to capture the Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 56 credit assignment properties of backprop with more biologically plausible implementations in the hope that a biological model could match backprop’s learning performance [134, 110, 114, 100, 104, 103, 46, 41, 189, 101, 187, 190, 191, 43, 42, 192, 193, 109]. However, a problem with most of these models is that there is always an implicit assumption that during some phases of learning no sensory stimuli are processed, i.e. the models are not “online” in their learning, which is problematic for both biological plausibility and for potential future development of low-energy neuromorphic computing devices. Moreover, there are several well-established properties of real neurons, including nonlinearities in the apical dendrites [60], short-term synaptic plasticity (STP) [194, 195], and inhibitory microcircuits that are ignored. None of the previous studies successfully incorporated all of these features to perform online credit assignment (Table B.1). Furthermore, none of these models captured the frequency dependence of synaptic plasticity, which is a very well-established property of LTP/LTD [68, 196, 17, 197, 67].

As established in non-hierarchical systems, such as the electrosensory lateral line lobe of the elec- tric fish [198, 199, 200] or the cerebellum [201], feedback connections on dendrites are well-poised to orchestrate learning [202]. But for credit assignment in hierarchical networks, these connections should obey four constraints: 1) Feedback must steer the sign and magnitude of plasticity. 2) Feedback signals from higher-order areas should be multiplexed with feedforward signals from lower-order areas so that credit information can percolate down the hierarchy with minimal disruption to sensory information. 3) There should be some degree of alignment between feedback connections and feedforward connections. 4) Integration of credit-carrying feedback signals should be close to linear and avoid saturation (i.e., feedback signals should be linear with respect to any credit information). Experimental and theoretical work have addressed steering [203, 69], multiplexing [204, 205, 206, 115], alignment [166, 167, 46, 42] or linearity [207] in isolation. But, it remains unclear whether a single set of cellular and subcellular mechanisms can address all four requirements for orchestrating learning in cortical hierarchies efficiently.

Here, we address the credit assignment problem with a spike-based learning rule that models how high-frequency bursts determine the sign of synaptic plasticity [68, 196, 17, 197, 67]. Guided by the underlying philosophy first espoused by the work of K¨ordingand K¨onig(2001) [134] that the unique properties of pyramidal neurons may contain a solution to biologically plausible credit assignment, we show that combining properties of apical dendrites [60] with our burst-dependent learning rule allows feedback to steer plasticity. We further show that feedback information can be multiplexed across mul- tiple levels of a hierarchy when feedforward and feedback connections have distinct STP [208, 209]. Using spiking simulations, we demonstrate that these mechanisms can be used to coordinate learning across a hierarchical circuit in a fully online manner. We also show that a coarse-grained equivalent of these dynamical properties will, on average, lead to learning that approximates loss-function gradients as used in backprop. We further show that this biological approximation to loss-function gradients is improved by a burst-dependent learning rule performing the alignment of feedback weights with feedfor- ward weights, as well as recurrent inhibitory connections that linearize credit signals. Finally, we show that networks trained with these mechanisms can learn to classify complex image patterns with high accuracy. Altogether, our work highlights that well-known properties of dendritic excitability, synap- tic transmission, short-term synaptic plasticity, inhibitory microcircuits, and burst-dependent synaptic plasticity are sufficient to solve the credit assignment problem in hierarchical networks. Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 57

Figure 4.1: The credit assignment problem for hierarchical networks. (a) Illustration of a hierarchical neural network with feedforward and feedback connections. (b) For an orchestration of learning in this network, the representations in higher-level neurons should steer the plasticity of connections at a lower level.

4.4 Results

4.4.1 A burst-dependent rule enables top-down steering of plasticity

Experimental work has demonstrated that the sign of plasticity can be determined by patterns of pre and postsynaptic activity. The most common formulation of this is spike-timing-dependent plasticity (STDP), wherein the timing of pre and postsynaptic spikes is what determines whether LTP or LTD occurs [210, 173, 211]. However, there is also evidence suggesting that in many circuits, particularly mature ones [212], the principal determinant of plasticity is the level of postsynaptic depolarization, with large depolarization leading to LTP and small depolarization leading to LTD [172, 68, 174, 113, 83], which is a direct consequence of the dynamics of N-methyl-D-aspartate receptor (NMDAR)-dependent calcium influx [213]. Importantly, one of the easiest ways to induce large magnitude depolarization in dendrites is via backpropagation of high-frequency bursts of action potentials [157] and, therefore, the degree of postsynaptic bursting controls plasticity [68, 83, 17, 197, 67]. Since bursting may be modulated by feedback synapses on apical dendrites [60, 214], feedback could control plasticity in the basal dendrites via control of bursting. Thus, in considering potential mechanisms for credit assignment during top- down supervised learning, the burst-dependence of synaptic plasticity appears to be a natural starting point.

To explore how high-frequency bursting could control learning in biological neural networks, we formulated a burst-dependent plasticity rule as an abstraction of the experimental data. We consider a burst to be any occurrence of at least two spikes with a short (i.e. under 16 ms) interspike interval. Following Ref. [115], we further define an event as either an isolated single spike or a burst. Thus, for a given neuron’s output, there is an event train (similar to a spike train, except that events can be either bursts or single spikes) and a burst train, which comprises a subset of the events (see Methods). We note that these definitions impose a ceiling on the frequency of events of 62.5 Hz, which is well above the typical firing frequency of cortical pyramidal neurons [214, 215]. The learning rule states that the change over time of a synaptic weight between postsynaptic neuron i and presynaptic neuron j, dwij/dt, results from a combination of an eligibility trace of presynaptic activity, Eej, and the potentiating (or Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 58

Figure 4.2: Burst-dependent plasticity rule. (a) Schematics of the learning rule. When there is a presynaptic eligibility trace, the occurrence of a postsynaptic burst leads to potentiation (top) whereas an isolated postsynaptic spike leads to depression of the synapse (bottom). (b-d) Net weight change for different pairing protocols. (b) The periodic protocol consisted of 15 sequences of 5 pairings, separated by a 10 s interval. We used pairings with tpost = tpre.(c) For the Poisson protocol, the pre and postsynaptic activities were Poisson spike trains with equal rates. The protocol was repeated with different initial time-average burst probabilities (P ). (d) For the burst-Poisson protocol, pre and postsynaptic Poisson events were generated at a fixed rate (ER). For each event, a burst was produced with a probability that varied from 0 to 50%. (e-g) Impact of distal inputs on burst probability and feedforward synaptic weights for constant presynaptic event rate. Positive distal input (90–140 s) increases burst probability (e) and strengthens feedforward synapses (f). Negative distal input (190–240 s) decreases burst probability and weakens synapses. A dendritic input to the presynaptic neuron (290–340 s) increases its burst probability and mildly affects its event rate (g), but does not significantly change the weights (f). (e) Event rate (ER; blue), burst probability (BP; solid red curve) and estimated BP (dashed red curve) for the postsynaptic population. The black dotted line indicates the prestimulation ER and serves as a reference for the variations of the ER with plasticity. (f) Weight change relative to the initial average value of the weights. (g) Same as panel e, but for the presynaptic population. For the schematic on the right-hand side, black and grey axonal terminals onto the presynaptic (green) population represent Poisson input noise; such noise is absent for the postsynaptic (light blue) population for this simulation.

depressing) effect of the burst train Bi (or event train Ei) of the postsynaptic cell (Fig. 4.2a):

dwij = η[Bi(t) − P i(t)Ei(t)]Eej(t). (4.1) dt

The constant η is the learning rate. The variable P i ∈ [0, 1] is an exponential moving average of the proportion of events that are bursts in postsynaptic neuron i, with a slow (∼ 1−10 s) time constant (see Methods). When a postsynaptic event that is not a burst occurs, the weight decreases proportionally to

P i(t)Eej(t) < 0. In contrast, if a postsynaptic event is a burst then the weight increases proportionally to [1 − P i(t)]Eej(t) > 0. Hence, this moving average regulates the relative strength of burst-triggered potentiation and event-triggered depression and can also be implemented by changes in the thresholds for controlling how NMDA-dependent calcium influx translates into either LTD or LTP [213]. It has been well established that such mechanisms exist in real neurons [216, 217]. Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 59

The plasticity rule stipulates that when a presynaptic input is paired with a postsynaptic burst LTP is induced, and otherwise, LTD results (Fig. 4.2a) [212, 196, 17, 197, 67, 213]. Using this rule, we simulated a series of synaptic plasticity experiments from the experimental and computational literature. First, we examined a frequency-dependent STDP protocol [83]. We found that when the spike pairing frequency is low, LTD is produced, and when the pairing frequency is high, LTP is produced (Fig. 4.2b). This matches previous reports on frequency-dependent STDP and shows that a burst-dependent synaptic plasticity rule can explain this data. Then, we explored the behavior of our rule when the pre and postsynaptic neuron fire independently according to Poisson statistics [218] (Fig. 4.2c). Experimental results have established that in such a situation the postsynaptic firing rate should determine the sign of plasticity [83]. As in similar learning rules [218], we found that a burst-dependent plasticity rule produces exactly this behavior (Fig. 4.2c), but with a dependence on bursting history not typically explored. Notably, contrary to the Bienenstock-Cooper Munro (BCM) model [219] where the switching point between LTD and LTP depends on a nonlinear moving average of of the forward-feeding activity, in the present case, the adaptive threshold is a burst probability, which can be controlled independently of the forward-feeding activity. These results demonstrate that a burst-dependent plasticity rule is capable of uniting a series of known experimental and theoretical results.

The burst-dependent rule suggests that feedback-mediated steering of plasticity could be achieved if there were a mechanism for top-down control of the likelihood of a postsynaptic burst. To illustrate this, in Fig. 4.2d we simulated another protocol wherein events were generated with Poisson statistics, and each event could become a burst with probability P (x axis in Fig. 4.2d). Manipulating this burst probability against the initial burst probability estimate (P i(0) = 20%) controlled the occurrence of LTP and LTD, while changing the pre and postsynaptic event rates simply modified the rate of change of the weight (but not the transition point between LTP and LTD). This shows that one way for neurons to control the sign of plasticity to ensure effective learning may be to regulate the probability of high- frequency bursts. Interestingly, evidence suggests that in cortical pyramidal neurons of sensory cortices the probability of generating high-frequency bursts is controlled by inputs to the distal apical dendrites and their activation of voltage-gated calcium channels (VGCCs) [60, 220, 221, 222, 214]. Anatomical and functional data has shown that these inputs often come from higher-order cortical or thalamic regions [223, 224].

We wondered whether combining a burst-dependent plasticity rule with regenerative activity in apical dendrites could permit top-down signals to act as a “teaching signal”, instructing the sign of plasticity in a neuron. To explore this, we ran simulations of pyramidal neuron models with simplified VGCC kinetics in the apical dendrites (see Methods). We found that by manipulating the distal inputs to the apical dendrites we could control the number of events and bursts in the neurons independently (Figs. 4.2e, g). Importantly, the inputs to the apical dendrites in the postsynaptic neurons were what regulated the number of bursts, and this also controlled changes in the synaptic weights, through the burst-dependent learning rule. When the relative proportion of bursts increased, the synaptic weights potentiated on average, and when the relative proportion of bursts decreased, the synaptic weights depressed (Fig. 4.2f). Thus, in Fig. 4.2f, the weight increases (decreases) on average when P − P¯ is positive (negative). Modifying the proportion of bursts in the presynaptic neurons had little effect on the weights (see the rightmost gray shaded area in Fig. 4.2e-g). The sign of plasticity was independent of the number of events, though the magnitude was not. Therefore, while the number of events contributed to the determination of the magnitude of changes, the top-down inputs to the apical dendrites controlled Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 60 the sign of plasticity. In this way, the top-down inputs acted as a “teaching signal” that determined whether LTP or LTD would occur. These results show that a burst-dependent learning rule paired with the control of bursting provided by apical dendrites enables a form of top-down steering of synaptic plasticity in an online, local, and spike-based manner.

4.4.2 Dendrite-dependent bursting combined with short-term plasticity sup- ports multiplexing of feedforward and feedback signals

The question that naturally arises from our finding that top-down inputs can steer synaptic plasticity via a burst-dependent rule is whether feedback can steer plasticity without affecting the communication of bottom-up signals? Using numerical simulations, we previously have demonstrated that in an ensemble of pyramidal neurons the inputs to the perisomatic and distal apical dendritic regions can be distinctly encoded using the event rate computed across the ensemble of cells and the percentage of events in the ensemble that are bursts (the “burst probability”), respectively [115]. When communicated by synapses with either short-term facilitation (STF) or short-term depression (STD), this form of “ensemble multiplexing” may allow top-down and bottom-up signals to be simultaneously transmitted through a hierarchy of pyramidal neurons. To explore this possibility, we conducted simulations of two reciprocally connected ensembles of pyra- midal neurons along with interneurons providing feedforward inhibition. One ensemble received currents in the perisomatic region and projected to the perisomatic region of the other ensemble (Fig. 4.3a, green ensemble). The other ensemble (Fig. 4.3a, light blue) received currents in the distal apical compartment and projected to the distal apical compartment of the first ensemble. As such, we considered the first ensemble to be “lower” (receiving and communicating bottom-up signals), and the other to be “higher” (receiving and communicating top-down signals) in the hierarchy. Furthermore, we made one key as- sumption in these simulations. We assumed that the synapses in the perisomatic regions were short-term depressing, whereas those in the distal apical dendrites were short-term facilitating. Additionally, we assumed that the inhibitory interneurons targeting the perisomatic region possessed STD synapses, and the inhibitory interneurons targeting the distal apical dendrites possessed STF synapses. These properties are congruent with what is known about parvalbumin-positive and somatostatin-positive in- terneurons[194, 195, 225], which target the perisomatic and apical dendritic regions, respectively. In these simulations, we observed that currents injected into the lower ensemble’s perisomatic com- partments were reflected in the event rate of those neurons (Fig. 4.3c3), though with a slight phase lead due to spike frequency adaptation. In contrast, the currents injected into the distal apical dendrites of the higher ensemble were reflected in the burst probability of those neurons (Fig. 4.3b2). Importantly, though, we also observed that these signals were simultaneously propagated up and down. Specifically, the input to the lower ensemble’s perisomatic compartments was also encoded by the higher ensemble’s event rate (Fig. 4.3b3), whereas the burst rate of the higher ensemble was encoded by the lower en- semble’s burst probability (Fig. 4.3c2). In this way, the lower ensemble had access to a conjunction of the signal transmitted to the higher ensemble’s distal apical dendrites, as well as the higher ensem- ble’s event rate (see arrow highlighting amplitude modulation in Fig. 4.3c2). Thus, since the higher ensemble’s event rate is modulated by the lower ensemble’s event rate, the burst rate ultimately contains information about both the top-down and the bottom-up signals (Fig. 4.3d). Notably, this is important for credit assignment, as credit signals ideally are scaled by the degree to which a neuron is involved in processing a stimulus (this happens in backprop, for example). Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 61

These simulations demonstrate that if bottom-up connections to perisomatic regions and perisomatic inhibition rely on STD synapses, while top-down connections to apical dendrites and distal dendritic inhibition utilize STF synapses, then ensembles of pyramidal neurons are capable of simultaneously processing both a top-down signal and a bottom-up signal using a combination of event rates, burst rates, and burst probabilities. We conclude that with the appropriate organization of short-term synaptic plasticity mechanisms, a top-down signal to apical dendrites can 1) control the sign of plasticity locally (steering; Fig. 4.2a), 2) be communicated to lower ensembles without affecting the flow of bottom-up information (multiplexing; Fig. 4.3), and 3) be combined with bottom-up signals appropriately for credit assignment.

4.4.3 Combining a burst-dependent plasticity rule with short-term plasticity and apical dendrites can solve the credit assignment problem

To test whether STP, dendrite-dependent bursting and a burst-dependent learning rule can act simul- taneously in a hierarchy to support learning, we built a simulation of ensembles of pyramidal neurons arranged in three layers, with two ensembles of cells at the input, one ensemble of cells at the output, and two ensembles of cells in the middle (the “hidden” layer; Fig. 4.4a). The distal dendrites of the top ensemble received “teaching” signals indicating desired or undesired outputs. No other teaching signal was provided to the network. As such, the hidden layer ensembles were informed of the suitability of their output only via the signals they received from the output ensemble’s bursts. Currents injected into the somatic compartments of the input layer populations controlled their activity levels in accordance with the learning task to be discussed below. Compared to Figs. 4.2-4.3, for this simulation we made a few modifications to synaptic transmission and pyramidal neuron dynamics to streamline the burst- event multiplexing and decoding (see Methods). The most important addition, however, was that we modified the learning rule in Eq. 4.1 by multiplying the right-hand side by an additional global term, M(t), that gates plasticity. This term abstracts a number of possible sources of control of plasticity, like dendritic inhibition [207, 226, 214], or disinhibition through vasoactive intestinal peptide (VIP)-positive cells [227], burst sizes [228, 213] or transient neuromodulation [229, 177, 230]. Importantly, M(t) in our model gates plasticity without changing its sign, contrary to some models on the role of neuromodulation in plasticity [184]. Its role was to make sure that plasticity elicited by the abrupt onset and offset of each training example does not overcome the plasticity elicited by the teaching signal, i.e. it was used to ensure a supervised training regime. We accomplished this by setting M = 0 when no teaching signal was present at the output layer and M = 1 under supervision. In this way, we ensured that the teaching signal was the primary driver of plasticity. We trained our 3-layer network on the exclusive or (XOR) task, wherein the network must respond with a high output if only one input pool is active, and low output if neither or both input pools are active (Fig. 4.4a-b). We chose XOR as a canonical example of a task that requires a nonlinear hierarchy with appropriate credit assignment for successful learning. Before learning, the network was initialized such that the output pool treated any input combination as roughly equivalent (Fig. 4.4c, dashed line). To compute XOR, the output pool would have to learn to reduce its response to simultaneously active inputs and increase its response to a single active input. We set up the network configuration to address a twofold question: (1) Would an error signal applied to the top-layer neurons’ dendrites be propagated downward adequately? (2) Would the burst-dependent learning rule combine top-down signals with bottom-up information to make the hidden-layer neurons Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 62

Figure 4.3: Dendrite-dependent bursting combined with short-term plasticity supports the simultaneous propagation of bottom-up and top-down signals. (a) Schematic of the network. Lower-level pyramidal neurons (green) received a somatic current Is and projected with STD synapses to the somatic compartments of both a higher-level pyramidal neuron population (light blue) and to a population providing disynaptic inhibition (grey discs). The dendritic compartments of the light blue population received a current Id. The light blue neurons innervated with STF synapses both the dendritic compartments of the green pyramidal neurons and a population providing disynaptic inhibition (grey squares). (b1, c1) Raster plots of 25 out of the 4000 neurons per pyramidal population for the light blue (b1) and green (c1) populations. Blue ticks show the start of an event, being either a burst or an isolated spike. Orange ticks are the second spike in a burst; the remaining spikes in a burst are not shown. The corresponding population event rates (blue lines) and burst rates (orange lines) are superimposed. (b2-b3) Encoding performed by the light blue ensemble (pop 2). Its burst probability (b2, dotted red line) reflects the applied dendritic current Id (dashed black line), whereas its event rate (b3, dotted blue line) reflects the event rate of the green population (solid blue line). (c2-c3) Encoding performed by the green ensemble (pop 1). Its burst probability (c2, solid red line) reflects the burst rate (dotted orange line) of the light blue ensemble, whereas its event rate (solid blue line) reflects the applied somatic current Is (dashed black line). Arrow highlights amplitude modulation arising from the conjunction of top-down and bottom-up inputs. Results are displayed as mean ± 2SD over five realizations of the Poisson noise applied to all neurons in the network. In each panel, the encoded input signal has been linearly rescaled so that its range matches that of the output. For clarity, the encoded signals in panels b3 and c2 are displayed using their averages only (i.e., without the standard deviations). The bin size used in the population averages was 50 ms. The legend box applies to panels b1 to d inclusively. (d) Schematic illustrating information propagation in the network. Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 63

Figure 4.4: Burst-dependent plasticity can solve the credit assignment problem for the XOR task. (a) Each neuron population contained 500 pyramidal neurons. Feedforward connections transmitted events, while feedback connections transmitted bursts. The teacher (pink arrow) was applied by injecting a hyperpolarizing current into the output ensemble’s dendrites if their event rate was high in the presence of inputs that are either both active or both inactive. A depolarizing current was injected into the output ensemble’s dendrites if their event rate was low when only one of the inputs was active. The activity of the input populations was controlled by somatic current injections (grey arrows). The ⊕ and symbols represent the initialization of the feedback synaptic weights as mainly excitatory or inhibitory. (b) Input layer event rates (ERs) for the four input conditions presented sequentially in time. The duration of each example was 8 s. (c) Output ER before and after learning. The output ensemble acquired strong firing (event rate above the dotted line) at the input conditions associated with “true” in XOR. Results are displayed as mean ± 2SD over 5 random initializations of the single- neuron connectivity. In other panels, a single realization is displayed for clarity. (d) During learning, the dendritic input (dashed pink) applied to the output ensemble’s neurons controlled their burst probability in the last two seconds of the input condition. (e1-e2) During learning, the burst rate (BR) at the output layer is encoded into the BP of the hidden layer to propagate the error. For the hidden-2 population, this inherited credit is inverted with respect to that in the hidden-1 population. (f1-f2) After (full line) vs. before (dashed line) learning for the hidden layer. The ER decreased in hidden-1 but increased in hidden-2. The bin size used in the population averages was 0.4 s. Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 64 better feature detectors for solving XOR? Importantly, if the answer to these two questions were ‘yes’, we would expect that the two hidden ensembles would learn different features if they receive different feedback from the output. To test this, we provided hidden pool 1 with positive feedback from the output, and hidden pool 2 with negative feedback (Fig. 4.4a, light blue symbols). With this configuration, adequate error propagation to the two hidden pools would make their responses diverge with learning, and the output pool would learn to take advantage of this change. Indeed, our results showed that the XOR task was solved in this manner after training (Fig. 4.4c, solid line). To understand how this solution was aided by appropriate credit assignment, we examined the information about the top-down teaching signals in each layer. According to the learning rule, plasticity can be steered by controlling the instantaneous propensity to burst with respect to a moving average of the burst probability (see term Bi −P iEi in Eq. 4.1 and Fig. 4.2e-f). In the output pool, the error signal applied to the apical dendrites induced a temporary decrease in the burst probability when the input pools were both active or both inactive, and a temporary increase when only one input pool was active (Fig. 4.4d). These changes in the output burst probability modified the output burst rate, which was propagated to the hidden pools. As mentioned above, the hidden pools received top-down signals with different signs (Fig. 4.4e1-2, orange lines), and indeed their respective burst probabilities were altered in opposite directions (Fig. 4.4e1-2, red lines). Due to these distinct top-down signals and the adaptive threshold P i, the hidden pools’ response diverged during learning (Fig. 4.4f1-2). For instance, hidden pool 1 reduced its responses to both inputs being active, while hidden pool 2 increased its response. These changes were due to the top-down control of the plasticity of synapses between the input and hidden pools. We verified that solving this task depends on the plasticity of connections from input to hidden units, but only weakly on the size of the ensembles (Fig. B.1). Also, we verified that the task was solved when the time constant τavg was shorter (Fig. B.2), and when the feedback pathways had the same sign of connection (Fig. B.3). These results demonstrate that the propagation of errors using burst-multiplexing and the burst-dependent learning rule can combine to achieve hierarchical credit assignment in ensembles of pyramidal neurons.

4.4.4 Burst-dependent plasticity promotes linearity and alignment of feed- back

Having demonstrated that a burst-dependent learning rule in pyramidal neurons enables online, local, spike-based solutions to the credit assignment problem, we were interested in understanding the potential relationship between this algorithm and the gradient-descent-based algorithms used for credit assignment in machine learning. To do this, we wanted to derive the average behavior of the burst-dependent learning rule at the coarse-grained, ensemble-level, and determine whether it provided an estimate of a loss-function gradient. More precisely, in the spirit of mean-field theory and linear-nonlinear rate models [231, 232, 233], we developed a model where each unit represents an ensemble of pyramidal neurons, with event rates, burst probabilities, and burst rates as described above (Fig. B.4). In this step, we lump together aspects of the microcircuitry, such as feedforward inhibition by parvalbumin-positive cells which helps to linearize the transfer function of event rates and preventing bursting in the absence of apical inputs [234, 235]. Specifically, for an ensemble of pyramidal neurons, we defined e(t) and b(t) as ensemble averages of the event and burst trains, respectively. Correspondingly, p(t) = b(t)/e(t) refers to the ensemble-level burst probability. We then defined the connection weight between an ensemble Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 65

of presynaptic neurons and an ensemble of postsynaptic neurons, Wpost,pre, as the effective impact of the presynaptic ensemble on the postsynaptic ensemble, taking into consideration potential polysynaptic interactions. Note that this means that the ensemble-level weight, Wpost,pre, can be either positive or negative, as it reflects the cumulative impact of both excitatory and inhibitory synapses (see Appendix B). Our goal was then to derive the ensemble-level weight updates from the burst-dependent plasticity rule (Eq. 4.1). We assumed that any given pair of neurons were only weakly correlated on average, a reasonable assumption if the synaptic weights in the circuit are small [236]. Moreover, decorrelation between neurons is observed when animals are attending to a task [236], which suggests that this is a reasonable assumption for active processing states. We further assumed that the neuron-specific moving average burst probability (P i) is independent of the instantaneous occurrence of events. Using these assumptions, it can be shown (see Appendix B) that the effective weight averaged across both pre and postsynaptic ensembles obeys:

dW post,pre = ηM(t)[p (t) − p (t)]e (t)e (t) (4.2) dt post post post pre where the learning rate η is different from that appearing in Eq. 4.1, and ppost(t) is a ratio of moving averages for the postsynaptic burst rate and event rate. This learning rule can be shown to implement an approximation of gradient descent for hierarchical circuits, like the backpropagation-of-error algorithm [123]. Specifically, if we assume that the burst probabilities remain in a linear regime (linearity), that the feedback synapses are symmetric to the feedforward synapses (alignment), and that error signals are received in the dendrites of the top-level ensembles, then −[ppost(t) − ppost(t)]epost(t) is equivalent to the error signal sent backwards in backpropagation (see Appendix B). For the sake of computational efficiency, when simulating this ensemble-level learning, we utilized simplifications to the temporal dy- namics (i.e. we implemented a discrete-time version of the rule), though the fundamental computations being implemented were identical to the continuous-time equation above (see Methods and Appendix B). The assumptions of feedback linearity and alignment can be supported by the presence of additional learning mechanisms. First, we examined learning mechanisms to keep the burst probabilities in a linear regime. Multiple features of the microcircuit control linearity (Fig. B.5), including distal apical inhibition [114, 115], which is consistent with the action of somatostatin-positive Martinotti cells in cortical circuits [60, 207]. We used recurrent excitatory and inhibitory inputs to control the apical compartments’ potential (Fig. 4.5a). These dendrite-targeting inputs propagated bursts from neural ensembles at the same processing stage in the hierarchy, which provided them with the necessary information to keep the burst probabilities in a linear range of the burst probability function. We found that a simple homeostatic learning rule (see Methods) could learn to keep burst probabilities in a linear regime, thus improving gradient estimates (Fig. 4.5b). Second, we explored potential mechanisms for learning weight symmetry. Symmetry between feed- forward and feedback weights is an implicit assumption of many learning algorithms that approximate loss-function gradients. However, such an assumption is unnecessary, as it has been shown that it is possible to learn weight symmetry [167]. In one classic algorithm [106], weight symmetry is obtained if feedforward and feedback weights are updated with the same error signals, plus some weight decay [42]. In our model, this form of feedback weight update could be implemented locally because the error signal Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 66 used to update the feedforward weights in discrete time is the deviation of the burst rates from the mov- ing average baseline, and this, we propose, is also determining the updates to the feedback weights (see Methods). In brief, what this rule means in practice is that the apical dendrites would have a different learning rule than the basal dendrites, something that has been observed experimentally [17, 237]. As well, the specific learning rule used here assumes that the sign of plasticity at the feedback synapses is based on presynaptic bursts, rather than postsynaptic bursts. Whether such a phenomenon exists in real apical dendrites has, to our knowledge, not yet been examined. However, we note that there are many different potential algorithms for training feedback weights, and we selected this one largely because it has been shown to perform well in artificial neural networks [42, 45]. But, our model is not tied to this specific algorithm, and the bursting system we have used here may in fact work well with other proposed weight alignment algorithms, such as weight-mirroring or regression discontinuity design [42, 45]. Thus, we anticipate that this feedback learning rule could be updated in the future based on experimental findings. Here, it is a tool we used to determine whether the burst-dependent plasticity rule can learn challenging tasks if it is paired with a feedback learning rule that promotes weight alignment. Indeed, when we implemented this form of learning on the ensemble-level feedback weights we observed rapid weight alignment (Fig. 4.5c and Fig. B.6) and convergence to a loss-function gradient (Fig. 4.5d). We note that our algorithm does not achieve perfect alignment, though it still manages to do better than a network that uses fixed feedback weights (Fig. 4.5c-d, red and yellow lines). Thus, some system for promoting weight alignment helps with gradient estimation, but this needs not be perfect. Altogether, these results demonstrate that the burst-dependent learning rule, averaged across ensembles of pyra- midal neurons, and paired with biologically plausible learning rules for recurrent inputs and feedback connections, can provide a good estimate of loss-function gradients in hierarchical networks.

4.4.5 Ensemble-level burst-dependent plasticity in deep networks can sup- port good performance on standard machine learning benchmarks

We wanted to determine whether the ensemble-level learning rule could perform well on difficult tasks from machine learning that previous biologically plausible learning algorithms have been unable to solve. Specifically, we built a deep neural network comprised of pyramidal ensemble units that formed a series of convolutional layers followed by fully-connected layers (Fig. 4.6a). We then trained these networks on two challenging image categorization datasets that previous biologically plausible algorithms have struggled with: CIFAR-10 and ImageNet [100]. The training in all components of the network used our burst-dependent plasticity rule and recurrent inputs for linearization at fully-connected hidden layers. For the CIFAR-10 dataset, we observed a classification test error rate of 20.1 % after 400 epochs (where an epoch is a pass through all training images), similar to the error rate achieved with full gradient descent in a standard artificial neural network (Fig. 4.6b). Training the feedback weights was critical for enabling this level of performance on CIFAR-10, as fixed feedback weights led to much worse performance, even when the number of units was increased in order to match the total number of trainable parameters (see Tables B.3 and B.4), in line with previous results [100]. Furthermore, rich unit-specific feedback signals were critical. A network trained using a global reward signal, plus activity correlations, while theoretically guaranteed to follow gradient descent on average [185, 28], was unable to achieve good performance on CIFAR-10 in a reasonable amount of time (Fig. 4.6b, node perturbation). For the ImageNet dataset, we observed a classification error rate of 56.1 % on the top 5 predicted image classes with our algorithm, which is much better than Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 67

Figure 4.5: Burst-dependent plasticity of recurrent and feedback connections promotes gradient-based learning by linearizing and aligning feedback. (a) Diagram of a hidden-layer unit in the rate model. Each unit (green outline) in the network represents an ensemble of pyramidal neurons. Recurrent inputs (purple arrows) from all ensembles in a layer provide homeostatic control of the dendritic potential. (b) Throughout learning, recurrent weights were updated in order to push the burst probabilities towards the linear regime (top). This led to an overall decrease in the magnitudes of burst probabilities, while continuing to support positive and negative values necessary for credit as- signment (bottom). (c) Alignment of feedback weights Y and feedforward weights W for three layers in a three-hidden-layer network trained on MNIST. Each hidden layer contained 500 units. Homeostatic recurrent inputs slightly reduce the angle between the two sets of weights, denoted W ∠Y , while learning on the feedback weights dramatically improves weight alignment. Each datapoint is the angle between feedforward and feedback weights at the start of a training epoch. (d) Angle between our weight updates (δ) and those prescribed by the backpropagation algorithm (δBP), for three layers in a three-hidden-layer network trained on MNIST. Recurrent inputs slightly improve the approximation to backpropagation, whereas learning on the feedback weights leads to a much closer correspondence. Each datapoint is the average angle between weight updates during a training epoch. In c and d, results are displayed as mean ± SD over n = 5 trials. Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 68

Figure 4.6: Ensemble-level burst-dependent plasticity supports learning in deep networks. (a) The deep networks consisted of an input layer, a series of convolutional layers, and a series of fully- connected layers. Layers were connected with sets of feedforward weights (blue arrows) and feedback weights (orange arrows). Fully-connected hidden layer contained recurrent connections (purple arrows). (b) Our learning rule, combined with learning of the feedback weights, was able to reach the performance of the backpropagation algorithm (backprop) on the CIFAR-10 classification task. (c) A network trained using our learning rule was able to learn to classify images in the ImageNet dataset when feedback weights were also updated. In b and c, results are displayed as mean ± SD over n = 5 trials. Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 69 the error rate achieved when keeping the feedback weights fixed, and much closer to that of full gradient descent (Fig. 4.6c). The remaining gap between the ensemble-level burst-dependent learning rule and backprop performance on ImageNet can likely be explained by the fact that we could not use recurrent input at convolutional layers due to memory limitations, which led to degraded linearity of feedback at early layers (Fig. B.7). We also trained a network on the MNIST dataset, and achieved a similar performance of 1.1% error on the test set with all three algorithms (Fig. B.8). Therefore, these results demonstrate that the ensemble-level burst-dependent learning rule, coupled with additional mechanisms to promote multiplexing, linearity and alignment, can solve difficult tasks that other biologically plausible learning algorithms have struggled with.

4.5 Discussion

In this paper, we asked the following question: could high-frequency bursts in pyramidal neurons provide an instructive signal for synaptic plasticity that can coordinate learning across hierarchical circuits (Fig. 4.1)? We have shown that the well-known burst-dependence of plasticity combined with STP and regenerative dendritic activity turns feedback connections into a teacher (Fig. 4.2), which by multiplexing (Fig. 4.3) can coordinate plasticity across multiple synaptic jumps (Fig. 4.4). We then showed that, with some additional burst-dependent learning at recurrent and feedback synapses, these mechanisms provide an approximation of a loss-function gradient for supervised learning (Fig. 4.5) and perform well on challenging image classification tasks (Fig. 4.6). Together, these results show that a local, spike-based and experimentally supported learning rule that utilizes high-frequency bursts as an instructive signal can enable sophisticated credit assignment in hierarchical circuits. Decades of research into biologically plausible learning have struggled to find a confluence of biological properties that permit efficient credit assignment. In this manuscript, we focused on the frequency- dependence of LTP/LTD, STP, dendritic nonlinearities, and inhibitory microcircuits. We focused on these aspects in part because the previous literature has established that these properties have important links with synaptic plasticity [68, 226, 137, 214], but also because they are very well-established properties of cortical circuits. Our burst-dependent learning rule itself could readily be implemented by previously established synaptic plasticity signalling pathways [217]. Overall, our model can be seen as a concrete implementation of a recent proposal from [187], which posited that differences in activity over time could carry gradient signals. Here, we have shown that differences in the probability of high-frequency bursts can carry gradient signals without affecting the time-dependent flow of sensory information. Therefore, one of the primary lessons from our model is that when local synaptic plasticity rules are sensitive to high-frequency bursts, then pyramidal neurons possess the necessary machinery for backprop-like top-down control of synaptic plasticity. It is important to note that there are a number of limitations to our model. First, our ensemble-level models utilized many “ensemble units” that incorporated the activity of many pyramidal neurons, which could potentially require networks of disproportionate size. However, the functional impact of using many neurons in an ensemble is to provide a means for averaging the burst probabilities. Theoretically, this averaging could be done over time, rather than over neurons. If so, there is no reason that the algorithm could not work with single-neuron ensembles, though it would require a much longer time to achieve good estimates of the gradients. To some extent, this is simply the typical issue faced by any model of rate-based coding: if rates are used to communicate information then spatial or temporal averaging Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 70 is required for high-fidelity communication. Furthermore, we suspect that allowing population coding could reduce the number of neurons required for a reliable representation [238]. Next, by focusing on learning, we ignored other ongoing cognitive processes. For instance, the close link between attention and credit assignment implies that the same mechanisms may serve both attention and learning purposes [239, 240]. Although some experimental data points to a role of bursting in attention [241, 242], further work is required to establish if burst coding can give rise to attention-like capabilities in neural networks. The presence of the gating term, M(t), may be seen as an additional limitation in the model, since it is left in an abstract form and not directly motivated by biology. This term was introduced in order to ensure that learning was driven by the teaching signal and not by changes in the stimuli. Of course, if the goal is not supervised learning, but unsupervised learning, then this term may be unnecessary. Indeed, one may view this as a prediction of sorts, i.e. that learning to match a target should involve additional gating mechanisms that are not required for unsupervised learning. These gating mechanisms could be implemented, e.g., by dendritic disinhibition [207, 226, 214, 227] (Fig. B.5b) or transient neuromodulation [229, 177, 230]. Our model did not include any sophisticated disinhibition mechanisms or neuromodulatory systems. Yet, we know both disinhibition and neuromodulation can regulate synaptic plasticity [177]. Future work could investigate how burst-dependent plasticity and disinhibition/neuromodulation could interact to guide supervised learning. Another set of limitations derive from how we moved from detailed cellular-level simulations to abstract neural network models that were capable of solving complex tasks. For example, in moving to the abstract models, we gradually made a number of simplifying assumptions, including clear separation between bursts and single spikes, simplified STP, simplified bursting mechanisms, and ensemble-level units that represented spiking activity across multiple neurons with a single value. We highlight these limitations because it may not be trivial to simulate the cellular-level plasticity rule within a network large enough to implement sophisticated credit assignment. Moreover, effective training requires a non-trivial amount of hyperparameter tuning. Ideally, we would have the computational resources or the neuromorphic hardware to fully simulate many thousands of ensembles of pyramidal neurons and interneurons with complex synaptic dynamics and bursting in order to see if the cellular-level burst- dependent rule could also solve complicated tasks. However, these questions will have to be resolved by large-scale projects that can simulate millions of biophysically realistic neurons with complicated internal dynamics and tune their hyperparameters. Similarly, we did not include recurrent connections between pyramidal neurons within a layer, despite the fact that such connections are known to exist. We did this for the sake of simplicity, but again, we consider recurrent connectivity to be fully compatible with our model and a subject of future investiga- tions. Moreover, our model makes some high-level assumptions about the structure of cortical circuitry. For example, we assumed that top-down signals are received at apical dendrites while bottom-up signals are received at basal dendrites. There is evidence for this structure [223], but also some data showing that it is not always this way [243]. Likewise, we assumed that pyramidal neurons across the cortical hierarchy project reciprocally with one another. There is some evidence that the same cells that project backwards in the cortical hierarchy also project forwards [244], but the complete circuitry of cortex is far from determined. Our model makes a number of falsifiable experimental predictions that could be examined experimen- tally. First, the model predicts that there should be a polarization of STP along the sensory hierarchy, Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 71 with bottom-up synaptic projections being largely STD and top-down synaptic projections being largely STF. There are reports of such differences in thalamocortical projections [208, 209], which suggests that an important missing component of our model is the inclusion of thalamic circuitry. There are also reports of polarization of STP along the basal dendrosomatic axis [245], and our model would predict that this polarization should extend to apical dendrites. Second, because our model proposes that burst firing carries information about errors, there should be a relationship between burst firing and progress in learning. Specifically, our model predicts that the variance in burst probabilities across a population should be correlated with the errors made during learning (Fig. B.9). Experimental evidence in other systems supports this view [206, 214]. Third, our model predicts that the moving average of the number of times that a neuron emits a burst when it spikes (its “burst fraction”) should determine the threshold for LTP in pyramidal neurons. Thus, and consistent with the fact that synaptic weights have a finite range, our model predicts that if a neuron generally bursts whenever it spikes it will be more difficult to induce LTP in that neuron, and vice-versa. Finally, our model predicts that inhibition in the distal apical dendrites serves, in part, to homeostatically regulate burst probabilities to promote learning. Thus, a fairly simple prediction from the model is that manipulations of distal dendrite targeting interneurons, such as somatostatin positive interneurons, should lead to unusual levels of bursting in cortical circuits and disrupt learning. Some recent experimental evidence supports this prediction [226, 214]. It is worth emphasizing here that our model was based largely on the physiology of layer 5 neocortical pyramidal neurons, and therefore, these predictions are most applicable to this cell type. Whether this model and its predictions could also be applied to other cell types would depend on whether the computational principles we developed here could be modified to work with those cell types’ physiological properties. Linking low-level and high-level computational models of learning is one of the major challenges in computational neuroscience. Our focus on supervised learning of static inputs was motivated by recent progress in this area. However, machine learning researchers have also been making rapid progress in unsupervised learning on temporal sequences in recent years [246, 247]. We suspect that many of the same mechanisms we explored here, e.g. burst-dependent plasticity, but also many of the mechanism not explored here, e.g. plasticity induced by cooperative synaptic inputs producing dendritic spikes [248, 249], or bursting induced by feedforward activity escaping feedforward inhibition [234, 235], could be adapted for unsupervised learning of temporal sequences in hierarchical circuits. It is likely that the brain combines unsupervised and supervised learning mechanisms, and future research should be directed towards how neurons may combine different rules for these purposes. Ultimately, by showing that a top- down orchestration of learning is a natural result of a small set of experimentally observed physiological phenomena, our work opens the door to future approaches that utilize the unique physiology of cortical microcircuits to implement powerful learning algorithms on dynamic stimuli. Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 72

4.6 Methods

4.6.1 Spiking model

Spiking simulations were performed using the Auryn simulator [250], except for the pairing protocols of Fig. 4.2b-d, which used Python.

Event and burst detection

An event was said to occur either at the time of an isolated spike or at the time of the first spike in a burst. A burst was defined as any occurrence of at least two spikes with an interspike interval (ISI) less than the threshold bth = 16 ms [251, 115]. Any additional spike with ISI < bth belonged to the same burst. A neuron i kept track of its time-averaged burst probability P i by using exponential moving averages of its event train Ei and burst train Bi:

1 Z ∞ −τ/τavg Ei(t) = Ei(t − τ)e dτ (4.3) τavg 0 1 Z ∞ −τ/τavg Bi(t) = Bi(t − τ)e dτ (4.4) τavg 0

Bi(t) P i(t) = , (4.5) Ei(t) P P where τavg is a slow time constant (∼ 1-10 s). Also, Ei(t) = event δ(t−ti,event) and Bi(t) = burst δ(t− ti,burst), where ti,event and ti,burst indicate the timing of an event and of the second spike in a burst, respectively.

Plasticity rule

Weights were updated upon the detection of a postsynaptic event or burst according to

dwij n  o = ηM (1 + Hi)Bi + (−P i + Hi)Ei Eej + GiEj (4.6) dt

∞ R −τ/τpre where Eej(t) = 0 Ej(t − τ)e dτ is a presynaptic trace with time constant τpre. Here, τpre is typically much smaller than τavg, with τpre ∼ 10 ms, but it could possibly be made larger to accommodate plasticity rules with slower dynamics [137]. The prefactor M gates plasticity during training: in the XOR task (Fig. 4.4), M = 1 when the teaching signal is present and 0 otherwise. In Fig. 4.2, M = 1 throughout. Homeostatic terms help to restrict the activity of neurons to an appropriate range. The homeostatic functions Hi and Gi were defined as   emax Hi(t) = − 1 − Θ(Ei(t) − emax) (4.7) Ei(t)   Gi(t) = emin − Ei(t) Θ(emin − Ei(t)), (4.8) where emin (resp. emax) is a minimum (resp. maximum) event rate, and Θ(·) denotes the Heaviside step function. When the neuron-specific running average of the event rate, Ei(t), lies within these limits,

Hi = Gi = 0, and we recover the learning rule of Eq. 4.1. In most simulations, network parameters were Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 73

chosen in such a way that the homeostatic plasticity had little to no effect. Typically, we used emin = 2

Hz and emax = 10 Hz.

Pairing protocols

For all pairing protocols of Fig. 4.2b-d, we had τpre = 50 ms, τavg = 15 s, η = 0.1, and we set the homeostatic terms to zero.

• Periodic protocol. Five consecutive pairings were separated by a quiescent period of 10 s, 15 times. We used pairings with ∆t = 0. For each pairing frequency the starting value for the estimated

burst probability was P i(t = 0) = 0.15 and Ei(t = 0) = 5 Hz.

• Poisson protocol. Both the pre and postsynaptic neurons fired spikes at a Poisson rate r with a refractory period of 2 ms. For each r, the induction lasted 100 s and we averaged over 20

independent realizations. We used Ei(t = 0) = 5 Hz.

• Burst-Poisson protocol. Both the pre and postsynaptic neurons produced events at a Poisson rate (E) r, including a refractory period τref > bth. For each event, a burst was generated with probability (B) (B) (B) p and an intraburst ISI was sampled from Unif(τref , τref + tmax), with τref + tmax < bth. For (E) (B) the simulations in Fig. 4.2d, we used τref = 20 ms, τref = 2 ms and tmax = 10 ms. We set P i(t = 0) = 0.2 and the event rate of the pre and postsynaptic neurons were set to r = 5 Hz and r = 10 Hz, with corresponding values for the initial postsynaptic event rate estimates. For each r, the induction lasted 100 s and we averaged over 20 independent realizations.

Neuron models

• Pyramidal neurons The somatic compartment obeyed

CsV˙s = −(Cs/τs)(Vs − EL) + gsf(Vd) + Is − ws (4.9)

τws w˙ s = −ws + bτws S(t)

where Vs is the somatic membrane potential, ws is an adaptation variable, Is is the total current applied to the soma (includes noise and other synaptic inputs) and S(t) is the spike train of

the neuron. The function f(Vd) in the equation for Vs takes into account the coupling with the

dendritic compartment, with f(Vd) = 1/{1 + exp[−(Vd − Ed)/Dd]} and parameters Ed = −38 mV

and Dd = 6 mV. A spike occurred whenever Vs crossed a moving threshold from below. The latter jumped up by 2 mV right after a spike and relaxed towards -50 mV with a time constant of 27 ms.

Other somatic parameters were: τs = 16 ms, Cs = 370 pF, EL = −70 mV, τws = 100 ms, b = 200

pA, and gs = 1300 pA. The reset voltage after a spike was Vr = −70 mV. The dendritic compartment obeyed

CdV˙d = −(Cd/τd)(Vd − EL) + gdf(Vd) + cd(K ∗ S)(t) + Id − wd (4.10)

τwd w˙ d = −wd + aw(Vd − EL).

The function f(Vd) is the same as above and is responsible for the regenerative dendritic activity.

The term cd(K ∗S)(t) represents the backpropagating action potential, with the kernel K modeled as a box filter of amplitude one and duration 2 ms, delayed by 0.5 ms with respect to the somatic Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 74

spike. Other dendritic parameters were: τd = 7 ms, Cd = 170 pF, EL = −70 mV, τwd = 30 ms,

a = 13 nS, and gd = 1200 pA. This model and its parameters are described in more detail and compared with experimental data in Ref. [115].

• Dendrite-targeting inhibition. We modeled somatostatin-positive interneurons [252, 253, 140] using the adaptive exponential integrate-and-fire (AdEx) model [254]:

V −VT ∆ CV˙ = −gL(V − EL) + gL∆T e T + I − w

τww˙ = a(V − EL) + bτwS(t) − w

where I is the total current applied to the neuron. A spike occurred whenever V crossed Vcut = 24

mV and was followed by a refractory period τref . Parameter values were C = 100 pF, gL = 5 nS,

EL = −70 mV, VT = −62 mV, ∆T = 4 mV, τw = 500 ms, a = 0.5 nS, b = 10 pA, Vr = −65

mV and τref = 2 ms. In Fig. 4.3, these model neurons (grey squares in Fig. 4.3a) were receiving top-down excitation from higher-level pyramidal cells.

• Perisomatic inhibition We modeled parvalbumin-positive neurons [255] using the AdEx model with parameters chosen to reproduce qualitatively their typical fast-spiking phenotype. Parameter

values were C = 100 pF, gL = 10 nS, EL = −70 mV, VT = −48 mV, ∆T = 2 mV, Vr = −55

mV, τref = 1 ms and a = b = 0. In Fig. 4.3, these model neurons (grey discs in Fig. 4.3a) were receiving bottom-up excitation from the lower-level pyramidal cells.

Connectivity

In general, connections between distinct neural ensembles were sparse (∼ 5 − 20% connection prob- ability). Pyramidal neurons within an ensemble had no recurrent connections between their somatic compartments. Within a pyramidal ensemble, burst-probability linearization was enacted by sparse STF inhibitory synapses onto the dendritic compartments (Fig. B.10). These STF connections were not illustrated in Fig. 4.3a for clarity. The net strength of inputs onto the apical dendrites was chosen to preserve a stationary burst probability between 10 and 50 %, as in vivo experimental data reports burst probability between 15 and 25 % [215, 242].

Synapses

All synapses were conductance-based. The excitatory (resp. inhibitory) reversal potential was 0 mV (resp. −80 mV) and the exponential decay time constant was 5 ms (resp. 10 ms). There were no NMDA components to excitatory synapses. For a given connection between two ensembles, existing synapses had their strengths all initialized to the same value.

Noise

Each neuron (for single-compartment neurons) and each compartment (for two-compartment neurons) received its own (private) noise in the form of a high-frequency excitatory Poisson input combined to an inhibitory Poisson input. The only exception was the noise applied to the neural populations in Fig. Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 75

4.2e-g, where we used sparse connections from a pool of excitatory and inhibitory Poisson neurons. Noise served to decorrelate neurons within a population and to imitate in vivo conditions.

Short-term plasticity

STP was modeled following the extended Markram-Tsodyks model [195]. Using the notation of Ref. [256], the parameters for STF were D = 100 ms, F = 100 ms, U = 0.02 and f = 0.1. For STD, the parameters were D = 20 ms, F = 1 s, U = 0.9 and f = 0.1. These sets of parameters were chosen following [115] to help decode bursts (using STF) and events (using STD).

Spiking XOR gate

A XOR gate maps binary inputs (0, 0) and (1, 1) onto 0 and inputs (1, 0) and (0, 1) onto 1. In the context of our spiking network, input 0 corresponded to a low event rate (∼ 2 Hz) and input 1 to a higher event rate (∼ 10 Hz). These were obtained by applying a hyperpolarizing (resp. depolarizing) current for 0 (resp. 1) to the corresponding input-layer population. Importantly, compared to the spiking simulations described above, our implementation of the spiking XOR gate used three simplifications to reduce the dimension of the parameter search space. First, events and bursts were propagated directly instead of relying on STP (see Fig. B.11). Second, disynaptic inhibition was replaced by direct inhibition coming from the pyramidal cells. Third, we used a simplified pyramidal neuron model. Below, we describe this model, as well as the initialization of the network, the error generation and the learning protocol for the XOR gate.

• Simplified pyramidal neuron model. The effect of dendritic regenerative activity on the somatic

compartment (controlled by gs in Eqs. 4.9-4.10) was replaced by a conditional burst probability:

whenever a somatic event occurred, a burst was produced with probability f(Vd). This function

is the same as that appearing in Eqs. 4.9-4.10, but with Ed = −57 mV. This model permitted a cleaner burst-detection process and burst-ensemble multiplexing.

• Initialization of the network. The feedforward synaptic strengths were initialized so that the

event rates of all pyramidal ensembles in the network belonged to [emin, emax] for all inputs. All existing feedforward excitatory synaptic strengths were equal together, and likewise for the in- hibitory synapses. The feedback synaptic strengths from the output population to the hidden populations—the only existing ones—were initialized so that one coarse-grained connection would be predominantly excitatory and the other inhibitory (the one onto hidden pool 2 in Fig. 4.4). As with the feedforward connections, the excitatory feedback synapses belonging to the same coarse- grained connection shared the same strength, and likewise for inhibition. A constant depolarizing current was applied to the hidden pool 2’s dendritic compartments to compensate for the stronger inhibition.

• Error generation. At the output layer, we specified a maximum and a minimum event rate, emax

and emin (the same as in the learning rule of Eq. 4.6). The following linearly transformed Ei

0 Ei − emin Ei = emax − emin Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 76

was then used in conjunction with a cross entropy loss function to compute the error for each (d) neuron of the output population. As a result, a current, Ii (where d indicates “dendritic”), was injected into every neuron so that its burst probability would increase or decrease according to the running average of its event rate and the desired output:

(d) if desired output = 0 ⇒ Ii = −c/(emax − Ei) (d) if desired output = 1 ⇒ Ii = c/(Ei − emin).

where c ∼ 1 nA · Hz. For instance, if the desired output was 0 and Ei was large, then the injected

current was strongly hyperpolarizing. The injected current was set to zero when Ei was to within 1 Hz of the desired value.

• Learning protocol. A simulation proceeded as follows. With the plasticity off, there was first a

relaxation interval of duration 3τavg, with no input applied to the network. In Fig. 4.4, we have

set τavg = 2 s, although a faster time scale can still yield adequate learning (Fig. B.2). Then, the four different input pairs were applied consecutively to give the “before learning” response in Fig. 4.4d. Afterward, the four input/output pairs were applied consecutively (for one epoch), typically in the same order (but see Fig. B.1e). For each input/output pair, first, the input alone was applied to the input populations with the plasticity off. We let the network reach its steady state for that input for the first 90% of the duration of an example. During this prediction interval, the moving average of the burst probability would converge towards the actual burst probability

of the population for that given input. The duration of an example was chosen to be 4τavg = 8 s to provide enough time for this steady state to be reached to a good approximation, although relaxing that assumption can still produce adequate learning (Fig. B.2). During the last 10% of the example duration, the plasticity was activated for all feedforward excitatory synapses and the teacher was applied. For computational efficiency, the error was computed once, at the very end of the prediction interval. The total number of epochs required to reach decent performance depended on the initialization of the network and the learning rate; for Fig. 4.4, we used 500 epochs. At the end of learning, the plasticity was switched off for good and the “after learning” response was computed.

4.6.2 Deep network model for categorical learning

We now describe the deep network model that was used to learn the classification tasks reported in Figs. 4.5-4.6. The model can be seen as a limiting case of a time-dependent rate model, which itself can be heuristically derived from the spiking network model under simplifying assumptions (see Appendix B). For the fully-connected layers in the network, we defined the “somatic potentials” of units in layer l as:

vl = Wlel−1, where Wl is the weight connecting layer l − 1 to layer l. Note that in this formulation we include a bias term as a column of Wl. The event rate of layer l was given by

el = fl (vl) , Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 77

where fl is the activation function for layer l. In models trained on MNIST and CIFAR-10, the activation function was a sigmoid. In the model trained on ImageNet, a ReLU activation was used for hidden layers and a softmax activation was used at the output layer.

During the feedforward pass, the burst probability at the output layer (l = L) was set to a constant, (0) pL (in these experiments, this was set to 0.2). Our previous research [115] has shown that the dendritic transfer function is a sigmoidal function of its input (see also Fig. B.5). Therefore, the hidden-layer burst probabilities, pl, for l < L, were computed using a sigmoidal function of a local “dendritic potential” ul as

pl = σ(βul + α), where α and β are constants controlling the dendritic transfer function. In our experiments, we set β = 1 and α = 0. Figure B.5 illustrates various mechanisms affecting these parameters. The dendritic potentials were given by  ul = h(el) Ylbl+1 , (4.11)

0 −1 where is the elementwise product. The vector-valued function h(el) ≡ f (vl) el depends on the chosen activation function; of course, some caution is required when ReLU and softmax activations are used (see Appendix B). The burst rate is given by

bl+1 = pl+1 el+1. (4.12)

Finally, Yl is the feedback weight matrix. For the feedback alignment algorithm, Yl is a random matrix and is fixed throughout learning [46]. In the standard backpropagation algorithm, the feedforward and T T feedback weight matrices are symmetric so that Yl = Wl+1, where denotes the transpose. Below, we also describe how to learn the feedback weights to make them symmetric with the feedforward weights using the Kolen-Pollack algorithm [42].

With the teacher present, the output-layer burst probabilities were set to a squashed version of (0) pL − h(el) ∇eL L, where L is the loss function (a mean squared error loss for Figs. 4.5-4.6). The squashing function was to make sure that pL,i ∈ [0, 1]. Appendix B provides a few examples of squashing functions. The burst probabilities of the hidden layers were then computed as above. Finally, the weights were updated according to  T ∆Wl = ηl (pl − pl) el el−1 − λWl, (4.13) where pl and pl denote the burst probabilities with and without teacher, respectively, ηl is the learning rate hyperparameter for units in layer l, and λ is a weight decay hyperparameter. Note that, for this model, el lags el by a single computational step (see Appendix B). Therefore, when the teacher appears, el = el and we can write

(pl − pl) el = bl − bl.

This means that, in this model, the error is directly represented by the deviation of the burst rate with respect to a reference.

In the case of convolutional layers, the event rates of ensembles in layer l were given by

el = fl(Wl ∗ el−1), Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 78

where ∗ represents convolution. Similarly, the dendritic potentials in layer l were given by ul = Yl ∗ bl+1 while burst probabilities were calculated as in the fully-connected layers. Finally, the weights of convolutional layers were updated as

 ∆Wl = ηlψ bl − bl, el−1 − λWl, (4.14) where ψ combines the burst deviations and event rates to compute an approximation of the gradient with respect to the convolutional weights Wl.

Learning the recurrent weights

In certain experiments, we introduced recurrent inputs into the hidden layers that served to keep burst probabilities in the linear regime of the sigmoid function. At layer l, we set the reference dendritic potentials to

ul = h(el) Ylbl+1 − Zlbl, (4.15) where Zl is the recurrent weight matrix and the burst rates used here, in bold sans-serif, are calculated as the burst rate without any recurrent inputs and without the teaching signal:

 bl = σ βh(el) Ylbl+1 + α el. (4.16)

Otherwise, the dendritic potentials and burst rates must be solved self-consistently, slowing down com- putations. Recurrent weights are then updated in order to minimize ul:

T ∆Zl = −ηrulbl , (4.17) where ηr is the learning rate. Note that, with these recurrent inputs, the updates of matrix Wl are the same as before, but now with

 pl = σ[β h(el) Ylbl+1 − Zlbl + α].

.

Learning the feedback weights

Kolen and Pollack [106] found that if the feedforward and feedback weights are updated such that

∆Wl = ηA − λWl

∆Yl = ηA − λYl, where A is any matrix with the same shape as Wl and Yl, then Yl and Wl will converge. This means that if the feedback weights are updated in the same direction as the feedforward weights and weight decay is applied to both sets of weights, they will eventually become symmetric. Thus, we implemented the following learning rule for the feedback weights between layer l + 1 and layer l:

 T ∆Yl = ηl bl+1 − bl+1 el − λYl, (4.18) Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 79 where λ is a weight decay hyperparameter. In convolutional layers, we used the following weight update:

 ∆Yl = ηlψ bl+1 − bl+1, el − λYl. (4.19)

Training the model with CIFAR-10 and ImageNet

The network architectures described in Tables B.3 and B.4 of the ?? were trained on standard image classification datasets, CIFAR-10 [257] and ImageNet [258]. The CIFAR-10 dataset consists of 60,000 32 × 32 px training images belonging to 10 classes, while the ImageNet dataset consists of 1.2 million images (resized to 224 × 224 px) split among 1000 classes. Each unit in these networks represents an ensemble of pyramidal neurons and has an event rate, burst probability, and burst rate. For each training example, the input image is presented and a forward pass is done, where event rates el throughout the network are computed sequentially, followed by a feedback pass where burst probabilities pl and burst rates bl are computed. Then, the teaching signal is shown at the output layer, and new burst probabilities pl and burst rates bl are computed backward through the network. Weights are then updated using our weight update rules. Networks were trained using stochastic gradient descent (SGD) with mini-batches, momentum and weight decay. ReLU layers were initialized from a normal distribution using Kaiming initialization [29], whereas Xavier initialization was used in sigmoid layers [259]. Hyperparameter optimization was done on all networks using validation data (see Appendix B for details).

Training the model using node perturbation

Node perturbation is a technique that approximates gradient descent by randomly perturbing the acti- vations of units in the network, and updating weights according to the change in the loss function [185, 28]. In the model trained using node perturbation, at each step, first the input is propagated through the network as usual, after which the global loss, L, is recorded. Then, the same input is propagated again through the network but the activations of units in a single layer l are randomly perturbed:

el = fl(Wl ∗ el−1 + ξl), (4.20) where the elements of ξl are chosen from a normal distribution with mean 0 and standard deviation

σ. The new loss, LNP, is recorded. The weights in layer l are then updated using the following weight update rule: 2 T ∆Wl = ηl (LNP − L)ξl/σ el−1. (4.21)

The layer to be perturbed, l, is changed with each mini-batch by iterating bottom-up through all of the layers in the network.

4.7 Acknowledgments

We thank Adam Santoro and Leonard Maler for comments on this manuscript. We also thank Markus Hilscher and Maximiliano Jos´eNigro for sharing data about SOM neurons. This work was supported by two NSERC Discovery Grants, 06872 (RN) and 04947 (BAR), a CIHR Project Grant (RN383647 - 418955), a Fellowship from the CIFAR Learning in Machines and Brains Program (BAR) and the Chapter 4. Project 2: Burst-dependent synaptic plasticity for credit assignment 80

Novartis Research Foundation (FZ).

4.8 Code availability

The codes used in this article are available at https://github.com/apayeur/spikingburstprop and https://github.com/jordan-g/Burstprop. Chapter 5

Project 3: Spike-based causal inference for weight alignment

Jordan Guerguiev1,2, Konrad P. Kording3, Blake A. Richards4,5,6,7,*

1 Department of Biological Sciences, University of Toronto Scarborough, Toronto, ON, Canada 2 Department of Cell and Systems Biology, University of Toronto, Toronto, ON, Canada 3 Department of Bioengineering, University of Pennsylvania, PA, United States 4 Mila, Montreal, QC, Canada 5 Department of Neurology & Neurosurgery, McGill University, Montreal, QC, Canada 6 School of Computer Science, McGill University, Montreal, QC, Canada 7 Canadian Institute for Advanced Research, Toronto, ON, Canada * Corresponding author, email: [email protected]

This chapter was originally published as a manuscript at the International Conference on Learning Representations [45].

5.1 Abstract

In artificial neural networks trained with gradient descent, the weights used for processing stimuli are also used during backward passes to calculate gradients. For the real brain to approximate gradients, gradient information would have to be propagated separately, such that one set of synaptic weights is used for processing and another set is used for backward passes. This produces the so-called “weight transport problem” for biological models of learning, where the backward weights used to calculate gradients need to mirror the forward weights used to process stimuli. This weight transport problem has been considered so hard that popular proposals for biological learning assume that the backward weights are simply random, as in the feedback alignment algorithm. However, such random weights do not appear to work well for large networks. Here we show how the discontinuity introduced in a spiking system can lead to a solution to this problem. The resulting algorithm is a special case of an estimator

81 Chapter 5. Project 3: Spike-based causal inference for weight alignment 82 used for causal inference in econometrics, regression discontinuity design. We show empirically that this algorithm rapidly makes the backward weights approximate the forward weights. As the backward weights become correct, this improves learning performance over feedback alignment on tasks such as Fashion-MNIST, SVHN, CIFAR-10 and VOC. Our results demonstrate that a simple learning rule in a spiking network can allow neurons to produce the right backward connections and thus solve the weight transport problem.

5.2 Author contributions

Jordan Guerguiev (JG) developed the computational model and performed all experiments. He also contributed to writing the manuscript. Konrad P. Kording served as an advisor for the development of the model, and assisted with writing the manuscript. Blake A. Richards is the thesis supervisor for JG, and provided guidance and support with the development of the model and selection of experiments to perform. He also contributed to the writing of the manuscript.

5.3 Introduction

Any learning system that makes small changes to its parameters will only improve if the changes are correlated to the gradient of the loss function. Given that people and animals can also show clear behavioral improvements on specific tasks [260], however the brain determines its synaptic updates, on average, the changes in must also correlate with the gradients of some loss function related to the task [261]. As such, the brain may have some way of calculating at least an estimator of gradients. To-date, the bulk of models for how the brain may estimate gradients are framed in terms of setting up a system where there are both bottom-up, feedforward and top-down, feedback connections. The feedback connections are used for propagating activity that can be used to estimate a gradient [185, 46, 42, 239, 110, 41, 114]. In all such models, the gradient estimator is less biased the more the feedback connections mirror the feedforward weights. For example, in the REINFORCE algorithm [185], and related algorithms like AGREL [239], learning is optimal when the feedforward and feedback connections are perfectly symmetric, such that for any two neurons i and j the synaptic weight from i to j equals the weight from j to i, e.g. Wji = Wij (Figure 5.1). Some algorithms simply assume weight symmetry, such as Equilibrium Propagation [41]. The requirement for synaptic weight symmetry is sometimes referred to as the “weight transport problem”, since it seems to mandate that the values of the feedforward synaptic weights are somehow transported into the feedback weights, which is not biologically realistic [38, 127]. Solving the weight transport problem is crucial to biologically realistic gradient estimation algorithms [46], and is thus an important topic of study. Several solutions to the weight transport problem have been proposed for biological models, including hard-wired sign symmetry [102], random fixed feedback weights [46], and learning to make the feedback weights symmetric [110, 114, 42, 106]. Learning to make the weights symmetric is promising because it is both more biologically feasible than hard-wired sign symmetry [102] and it leads to less bias in the gradient estimator (and thereby, better training results) than using fixed random feedback weights [100, 42]. However, of the current proposals for learning weight symmetry some do not actually work well in practice [100] and others still rely on some biologically unrealistic assumptions, including scalar value activation functions (as opposed to all-or-none spikes) and separate error feedback pathways with Chapter 5. Project 3: Spike-based causal inference for weight alignment 83 one-to-one matching between processing neurons for the forward pass and error propagation neurons for the backward pass [42, 114]. Interestingly, learning weight symmetry is implicitly a causal inference problem—the feedback weights need to represent the causal influence of the upstream neuron on its downstream partners. As such, we may look to the causal infererence literature to develop better, more biologically realistic algorithms for learning weight symmetry. In econometrics, which focuses on quasi-experiments, researchers have developed various means of estimating causality without the need to actually randomize and control the variables in question [262, 263]. Among such quasi-experimental methods, regression discontinuity design (RDD) is particularly promising. It uses the discontinuity introduced by a threshold to estimate causal effects. For example, RDD can be used to estimate the causal impact of getting into a particular school (which is a discontinuous, all-or-none variable) on later earning power. RDD is also potentially promising for estimating causal impact in biological neural networks, because real neurons communicate with discontinuous, all-or-none spikes. Indeed, it has been shown that the RDD approach can produce unbiased estimators of causal effects in a system of spiking neurons [116]. Given that learning weight symmetry is fundamentally a causal estimation problem, we hypothesized that RDD could be used to solve the weight transport problem in biologically realistic, spiking neural networks.

Figure 5.1: Illustration of weight symmetry in a neural network with feedforward and feed- back connections. Processing of inputs to outputs is mirrored by backward flow of gradient infor- mation. Gradient estimation is best when feedback synapses have symmetric weights to feedforward synapses (illustrated with colored circles).

Here, we present a learning rule for feedback synaptic weights that is a special case of the RDD algorithm previously developed for spiking neural networks [116]. Our algorithm takes advantage of a neuron’s spiking discontinuity to infer the causal effect of its spiking on the activity of downstream neurons. Since this causal effect is proportional to the feedforward synaptic weight between the two neurons, by estimating it, feedback synapses can align their weights to be symmetric with the reciprocal feedforward weights, thereby overcoming the weight transport problem. We demonstrate that this leads Chapter 5. Project 3: Spike-based causal inference for weight alignment 84 to the reduction of a cost function which measures the weight symmetry (or the lack thereof), that it can lead to better weight symmetry in spiking neural networks than other algorithms for weight alignment [42] and it leads to better learning in deep neural networks in comparison to the use of fixed feedback weights [46]. Altogether, these results demonstrate a novel algorithm for solving the weight transport problem that takes advantage of discontinuous spiking, and which could be used in future models of biologically plausible gradient estimation.

5.4 Related work

Previous work has shown that even when feedback weights in a neural network are initialized randomly and remain fixed throughout training, the feedforward weights learn to partially align themselves to the feedback weights, an algorithm known as feedback alignment [46]. While feedback alignment is successful at matching the learning performance of true gradient descent in relatively shallow networks, it does not scale well to deeper networks and performs poorly on difficult computer vision tasks [100]. The gap in learning performance between feedback alignment and gradient descent can be overcome if feedback weights are continually updated to match the sign of the reciprocal feedforward weights [102]. Furthermore, learning the feedback weights in order to make them more symmetric to the feedforward weights has been shown to improve learning over feedback alignment [42]. To understand the underlying dynamics of learning weight symmetry, [105] define the symmetric alignment cost function, RSA, as one possible cost function that, when minimized, leads to weight symmetry:

T 2 RSA := kW − Y kF (5.1) 2 2 = kW kF + kY kF − 2tr(WY )

where W are feedforward weights and Y are feedback weights. The first two terms are simply weight regularization terms that can be minimized using techniques like weight decay. But, the third term is the critical one for ensuring weight alignment. In this paper we present a biologically plausible method of minimizing the third term. This method is based on the work of [116], who demonstrated that neurons can estimate their causal effect on a global reward signal using the discontinuity introduced by spiking. This is accomplished using RDD, wherein a piecewise linear model is fit around a discontinuity, and the differences in the regression intercepts indicates the causal impact of the discontinuous variable. In [116], neurons learn a piece-wise linear model of a reward signal as a function of their input drive, and estimate the causal effect of spiking by looking at the discontinuity at the spike threshold. Here, we modify this technique to perform causal inference on the effect of spiking on downstream neurons, rather than a reward signal. We leverage this to develop a learning rule for feedback weights that induces weight symmetry and improves training.

5.5 Our contributions

The primary contributions of this work are as follows: Chapter 5. Project 3: Spike-based causal inference for weight alignment 85

• We demonstrate that spiking neurons can accurately estimate the causal effect of their spiking on downstream neurons by using a piece-wise linear model of the feedback as a function of the input drive to the neuron.

• We present a learning rule for feedback weights that uses the causal effect estimator to encourage weight symmetry. We show that when feedback weights update using this algorithm it minimizes

the symmetric alignment cost function, RSA.

• We demonstrate that this learning weight symmetry rule improves training and test accuracy over feedback alignment, approaching gradient-descent-level performance on Fashion-MNIST, SVHN, CIFAR-10 and VOC in deeper networks.

5.6 Methods

5.6.1 General approach

In this work, we utilize a spiking neural network model for aligning feedforward and feedback weights. However, due to the intense computational demands of spiking neural networks, we only use spikes for the RDD algorithm. We then use the feedback weights learned by the RDD algorithm for training a non-spiking convolutional neural network. We do this because the goal of our work here is to develop an algorithm for aligning feedback weights in spiking networks, not for training feedforward weights in spiking networks on other tasks. Hence, in the interest of computational expediency, we only used spiking neurons when learning to align the weights. Additional details on this procedure are given below.

5.6.2 RDD feedback training phase

At the start of every training epoch of a convolutional neural network, we use an RDD feedback weight training phase, during which all fully-connected sets of feedback weights in the network are updated. To perform these updates, we simulate a separate network of leaky integrate-and-fire (LIF) neurons. LIF neurons incorporate key elements of real neurons such as voltages, spiking thresholds and refractory periods. Each epoch, we begin by training the feedback weights in the LIF network. These weights are then transferred to the convolutional network, which is used for training the feedforward weights. The new feedforward weights are then transferred to the LIF net, and another feedback training phase with the LIF net starts the next epoch (Figure 5.2A). During the feedback training phase, the LIF network undergoes a training phase lasting 90 s of simulated time (30 s per set of feedback weights) (Figure 5.2B). We find that the spiking network used for RDD feedback training and the convolutional neural network are very closely matched in the activity of the units (Figure C.1), which gives us confidence that this approach of using a separate non-spiking network for training the feedforward weights is legitimate. During the feedback training phase, a small subset of neurons in the first layer receive driving input that causes them to spike, while other neurons in this layer receive no input (see Appendix C.2). The subset of neurons that receive driving input is randomly selected every 100 ms of simulated time. This continues for 30 s in simulated time, after which the same process occurs for the subsequent hidden layers in the network. This protocol enforces sparse, de-correlated firing patterns that improve the causal inference procedure of RDD. Chapter 5. Project 3: Spike-based causal inference for weight alignment 86

5.6.3 LIF dynamics

During the RDD feedback training phase, each unit in the network is simulated as a leaky integrate-and- fire neuron. Spiking inputs from the previous layer arrive at feedforward synapses, where they are con- volved with a temporal exponential kernel to simulate post-synaptic spike responses p = [p1, p2, ..., pm]

(see Appendix C.1). The neurons can also receive driving inputp ˜i, instead of synaptic inputs. The total feedforward input to neuron i is thus defined as:

 ωp˜i ifp ˜i > 0 Ii := (5.2) Pm  j=1 Wijpj otherwise

where Wij is the feedforward weight to neuron i from neuron j in the previous layer, and ω is a hyperparameter. The voltage of the neuron, vi, evolves as:

dv i = −g v + g (I − v ) (5.3) dt L i D i i

where gL and gD are leak and dendritic conductance constants, respectively. The input drive to the neuron, ui, is similarly modeled:

du i = −g u + g (I − u ) (5.4) dt L i D i i

If the voltage vi passes a spiking threshold θ, the neuron spikes and the voltage is reset to a value vreset = −1 (Figure 5.2C). Note that the input drive does not reset. This helps us to perform regressions both above and below the spike threshold. In addition to feedforward inputs, spiking inputs from the downstream layer arrive at feedback synapses, where they create post-synaptic spike responses q = [q1, q2, ..., qn]. These responses are used in the causal effect estimation (see below).

5.6.4 RDD algorithm

Whenever the voltage approaches the threshold θ (ie. |vi − θ| < α where α is a constant), an RDD window is initiated, lasting T = 30 ms in simulated time (Figure 5.2C). At the end of this time window, max at each feedback synapse, the maximum input drive during the RDD window, ui , and the average avg avg change in feedback from downstream neuron k during the RDD window, ∆qk , are recorded. ∆qk is avg defined as the difference between the average feedback received during the RDD window, qk , and the pre feedback at the start of the RDD window, qk :

avg avg pre ∆qk := qk − qk (5.5)

max Importantly, ui provides a measure of how strongly neuron i was driven by its inputs (and whether avg or not it passed the spiking threshold θ), while ∆qk is a measure of how the input received as feedback from neuron k changed after neuron i was driven close to its spiking threshold. These two values are avg max then used to fit a piece-wise linear model of ∆qk as a function of ui (Figure 5.2D). This piece-wise linear model is defined as: Chapter 5. Project 3: Spike-based causal inference for weight alignment 87

 1 2 cikx + cik if x < θ fik(x) := (5.6) 3 4 cikx + cik if x ≥ θ

1 2 3 4 The parameters cik, cik, cik and cik are updated to perform linear regression using gradient descent:

1 L = kf (umax) − ∆qavgk2 (5.7) 2 ik i k l ∂L ∆cik ∝ − l for l ∈ {1, 2, 3, 4} (5.8) ∂cik

An estimate of the causal effect of neuron i spiking on the activity of neuron k, βik, is then defined as the difference in the two sides of the piece-wise linear function at the spiking threshold:

βik := lim fik(x) − lim fik(x) (5.9) x→θ+ x→θ−

Finally, the weight at the feedback synapse, Yik, is updated to be a scaled version of βik:

γ Yik = βik 2 (5.10) σβ

2 where γ is a hyperparameter and σβ is the standard deviation of β values for all feedback synapses in the layer. This ensures that the scale of the full set of feedback weights between two layers in the network remains stable during training.

5.7 Results

5.7.1 Alignment of feedback and feedforward weights

To measure how well the causal effect estimate at each feedback synapse, βik, and thus the feedback weight Yik, reflects the reciprocal feedforward weight Wki, we can measure the percentage of feedback weights that have the same sign as the reciprocal feedforward weights (Figure 5.3A). When training on CIFAR-10 with no RDD feedback training phase (ie. feedback weights remain fixed throughout training), the feedback alignment effect somewhat increases the sign alignment during training, but it is ineffective at aligning the signs of weights in earlier layers in the network. Compared to feedback alignment, the addition of an RDD feedback training phase greatly increases the sign aligmnent between feedback and feedforward weights for all layers in the network, especially at earlier layers. In addition, the RDD algorithm increases sign alignment throughout the hierarchy more than the current state-of-the-art algorithm for weight alignment introduced recently by Akrout et al. [42] (Figure 5.3A). Furthermore, RDD feedback training changes feedback weights to not only match the sign but also the magnitude of the reciprocal feedforward weights (Figure 5.3B), which makes it better for weight alignment than hard-wired sign symmetry [102].

5.7.2 Descending the symmetric alignment cost function

The symmetric alignment cost function [105] (Equation 5.1) can be broken down as: Chapter 5. Project 3: Spike-based causal inference for weight alignment 88

RSA = Rdecay + Rself (5.11)

where we define Rdecay and Rself as:

2 2 Rdecay := kW kF + kY kF (5.12)

Rself := −2tr(WY ) (5.13)

Rdecay is simply a weight regularization term that can be minimized using techniques like weight decay. Rself, in contrast, measures how well aligned in direction the two matrices are. Our learning rule for feedback weights minimizes the Rself term for weights throughout the network (Figure 5.4). By comparison, feedback alignment decreases Rself to a smaller extent, and its ability to do so diminishes at earlier layers in the network. This helps to explain why our algorithm induces weight alignment, and can improve training performance (see below).

5.7.3 Performance on Fashion-MNIST, SVHN, CIFAR-10 and VOC

We trained the same network architecture (see Appendix C.3) on the Fashion-MNIST, SVHN, CIFAR-10 and VOC datasets using standard autograd techniques (backprop), feedback alignment and our RDD feedback training phase. RDD feedback training substantially improved the network’s performance over feedback alignment, and led to backprop-level accuracy on the train and test sets (Figure 5.5).

5.8 Discussion

In order to understand how the brain learns complex tasks that require coordinated plasticity across many layers of synaptic connections, it is important to consider the weight transport problem. Here, we presented an algorithm for updating feedback weights in a network of spiking neurons that takes advantage of the spiking discontinuity to estimate the causal effect between two neurons (Figure 5.2). We showed that this algorithm enforces weight alignment (Figure 5.3), and identified a loss function,

Rself, that is minimized by our algorithm (Figure 5.4). Finally, we demonstrated that our algorithm allows deep neural networks to achieve better learning performance than feedback alignment on Fashion- MNIST and CIFAR-10 (Figure 5.5). These results demonstrate the potential power of RDD as a means for solving the weight transport problem in biologically plausible deep learning models. One aspect of our algorithm that is still biologically implausible is that it does not adhere to Dale’s principle, which states that a neuron performs the same action on all of its target cells [264]. This means that a neuron’s outgoing connections cannot include both positive and negative weights. However, even under this constraint, a neuron can have an excitatory effect on one downstream target and an inhibitory effect on another, by activating intermediary inhibitory interneurons. Because our algorithm provides a causal estimate of one neuron’s impact on another, theoretically, it could capture such polysynaptic effects. Therefore, this algorithm is in theory compatible with Dale’s principle. Future work should test the effects of this algorithm when implemented in a network of neurons that are explicitly excitatory or inhibitory. Chapter 5. Project 3: Spike-based causal inference for weight alignment 89

Figure 5.2: Outline of the feedback weight learning algorithm. A. Layers of the convolutional network trained on CIFAR-10 and the corresponding network of LIF neurons that undergoes RDD feedback training. Fully-connected feedforward weights (blue) and feedback weights (red) are shared between the two networks. Every training epoch consists of an RDD feedback training phase where feedback weights in the LIF net are updated (bold red arrows) and transferred to the convolutional net, and a regular training phase where feedforward weights in the convolutional net are updated (bold blue arrows) and transferred back to the LIF net. B. RDD feedback training protocol. Every 30 s, a different layer in the LIF network receives driving input and updates its feedback weights (red) using the RDD algorithm. C. Top: Sample voltage (vi, solid line) and input drive (ui, dashed line) traces. Whenever max vi approaches the spiking threshold, an RDD window lasting T ms is triggered. ui is the maximum pre input drive during this window of time. Bottom: Feedback received at a synapse, qk. qk is the feedback avg signal at the start of an RDD window, while qk is the average of the feedback signal during the time avg max max window. D. Samples of ∆qk vs. ui are used to update a piece-wise linear function of ui , and the causal effect βik is defined as the difference of the left and right limits of the function at the spiking threshold. Chapter 5. Project 3: Spike-based causal inference for weight alignment 90

Figure 5.3: Feedback weights become aligned with feedforward weights during training. A. Evolution of sign alignment (the percent of feedforward and feedback weights that have the same sign) for each fully-connected layer in the network when trained on CIFAR-10 using RDD feedback training (blue), using the algorithm proposed by [42] (purple), and using feedback alignment (red). B. Feedforward vs. feedback weights for each fully-connected layer at the end of training, with RDD feedback training (blue), the Akrout algorithm (purple), and feedback alignment (red). Chapter 5. Project 3: Spike-based causal inference for weight alignment 91

Figure 5.4: Evolution of Rself for each fully-connected layer in the network when trained on CIFAR-10 using RDD feedback training (solid lines) and using feedback alignment (dashed lines). RDD feedback training dramatically decreases this loss compared to feedback alignment, especially in earlier layers.

Figure 5.5: Comparison of Fashion-MNIST, SVHN, CIFAR-10 and VOC train error (top row) and test error (bottom row). RDD feedback training substantially improves test error performance over feedback alignment in both learning tasks. Chapter 6

Discussion

The work presented in this thesis consists of three projects that as a whole present novel biologically plausible mechanisms of credit assignment in the brain. Below are brief summaries of the results of each project, followed by a discussion of some of the challenges that were encountered and the limitations of the models presented in each of the three projects, as well as a brief outline of the experimental predictions made by each model. Finally, some avenues for future research related to these projects, and the study of biological credit assignment as a whole, are presented.

The first project (Chapter 3) introduced a novel computational model for biologically plausible credit assignment in multi-layer networks of pyramidal neurons with segregated apical dendrites. This project involved simulating a network of three-compartment neurons with voltage dynamics, conductances and Poisson spiking. Random, fixed feedback weights between layers were used, leveraging the feedback alignment effect to overcome the issue of weight transport in a network with hidden layers. A plasticity rule at feedforward synapses was presented that uses only locally-available information to approximate gradient descent. Using this learning rule, a multi-layer network was trained on classifying handwritten digits from the MNIST database, and achieved good performance compared to backpropagation of error. The angle between the weight updates prescribed by the learning rule and by backprop was shown to decrease below 90o during training, demonstrating that the learning rule was approximating gradient descent.

The second project (Chapter 4) demonstrated that ensembles of pyramidal neurons can communicate rich error signals approximating gradient descent through a network hierarchy by multiplexing sensory feedforward information and feedback signals in their bursting activity. Notably, the communication of error signals does not interfere with the flow of feedforward sensory information at any point, unlike in previous models of biologically plausible gradient descent. A simulated network of leaky integrate-and- fire (LIF) pyramidal neurons and inhibitory interneurons learned to solve the XOR problem, a problem that requires gradient descent though multiple layers of connections. Furthermore, a rate-based model trained on CIFAR-10 and ImageNet was able to perform well on these challenging tasks, outperform- ing previous biologically-plausible models of gradient descent. In order to accomplish this, a simple biologically plausible learning rule for feedback weights that leads to weight symmetry was presented. Overall, this work presents a novel theory for how the unique properties of pyramidal neurons, namely,

92 Chapter 6. Discussion 93 segregated apical dendrites and bursting, can allow them to perform gradient descent in a way that allows concurrent communication of feedforward and feedback signals.

The third project (Chapter 5) proposed a plasticity rule for a synapse that changes its synaptic strength to reflect the causal effect of the neuron’s spiking on the activity of the pre-synaptic neuron. This learning rule, based on regression discontinuity design, takes advantage of the binary spiking nature of a neuron to create an unbiased estimator of the causal effect of its spiking on the input it receives at synapses. We demonstrate that, in a multi-layer neural network with feedback weights, this learning rule causes feedback weights to become symmetric with the feedforward weights in the network, providing a solution to the weight transport problem. This is shown to enable backprop-level performance on a variety of standard machine learning tasks, including CIFAR-10 and Fashion-MNIST.

6.1 Challenges and limitations

6.1.1 Project 1: Towards deep learning with segregated dendrites

While the first project (Chapter 3) successfully demonstrated that segregated dendrites can be useful to separate error signals from feedforward information, there are a number of limitations and remaining biological implausibilities in the final model. First, Poisson spiking neurons were used, which provided a theoretical guarantee about the estimation of a neuron’s firing rate by downstream neurons. While useful from a theoretical point of view, Poisson spiking does not reflect the complex spiking activity of real neurons. The following two projects used leaky integrate-and-fire neurons in order to more faithfully simulate neuronal spiking.

A second limitation of the model was that Ca2+ potentials and bursting, used to communicate feed- back signals at the apical dendritic compartment to the basal dendrites at the end of the forward and target phases, were modeled as instantaneuous processes. The voltage dynamics of these plateau poten- tials and the bursting that occurs as a consequence of the interplay of Ca2+ spikes and back-propagating action potentials were not simulated. Furthermore, while feedforward and feedback information can flow through the network unperturbed by feedback signals during the transmit state of the network, the use of plateau potentials to communicate feedback information to the soma implies that sensory feedforward information, encoded in the firing rate of a neuron, would be corrupted during the plateau state. This limitation was addressed in the second project, which uses burst multiplexing in ensembles of neurons to allow concurrent feedforward and feedback information flow through a network without any interference.

Another limitation of the model is that the feedback pathway consists of direct feedback connections from the output layer to all hidden layers in the network. Layer-to-layer feedback connections are not compatible with the presented learning rule. In Project 2, this issue is addressed by developing a weight update rule that is compatible with reciprocal feedback connections between layers.

The use of alternating transmit and plateau states of the network, with coordinated plateau poten- tials occurring across the population of neurons, is another biologically implausible aspect of the model. Plateau potentials occurring at the same time was done to simplify the computational model, but are Chapter 6. Discussion 94 not required for such a model to work. The use of two plateau potentials in the two phases of training, however, is problematic, because it does not reflect the complicated dynamics underlying the generation of plateau potentials, which are dependent on pre-synaptic and post-synaptic activity.

The main challenge faced during this project was the large amount of computational resources re- quired to simulate and train multi-layer spiking networks. This was primarily due to the near continuous- time simulation of voltage dynamics and spiking, which caused training to be very slow compared to traditional rate-based models used in machine learning. The choice to simulate neuronal dynamics rather than use a rate-based model was made in order to enhance the biological realism of the model. However, this meant that training deeper networks and/or training on more complicated tasks than MNIST, such as CIFAR-10 and ImageNet, proved infeasible. In the later projects, this problem was addressed by introducing an additional simplified rate-based model when training on more complex learning tasks requiring larger networks.

6.1.2 Project 2: Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits

The second project (Chapter 4) is among the first to present a biologically plausible model for gradient descent that incorporates a number of biological aspects including inhibitory neurons, spiking, multiple compartments and on-line learning. Nevertheless, there remain several limitations relating to biological plausibility. One was the use of thousands of neurons per ensemble in some experiments, which can lead to network sizes that are not realistic (however, the XOR experiments used 500 neurons per ensemble). The large number of neurons in an ensemble was used in order to obtain a good estimate of the average burst probability in a small timestep. In theory, the size of an ensemble can decrease while the length of time over which burst probabilities are averaged is increased. Future work could investigate the trade-off between ensemble size and integration time on obtaining reliable estimates of event and burst rates.

A second limitation is the use of a global neuromodulatory signal M(t) that gates plasticity. If such a signal was not used and plasticity was ongoing at all times, any weight changes caused by changes in burst probabilities when a target signal appears would be reversed once the target signal disappears. Weights would also be updated any time the input signal changes. This is an issue when doing supervised learning, where weight changes should be driven by the teaching signal. However, recent studies indicate that neuromodulators can act as a gating factor for plasticity in the cortex [78, 79], lending experimental support for the use of M(t) in plasticity rules.

As in Project 1, the large computational resource requirements of simulating networks of spiking neurons was a key challenge during the second project. This challenge was compounded due to the need to simulate ensembles of neurons, each of which represents a computational unit. This meant that training a network on the XOR problem required simulating thousands of spiking neurons. Due to this computational complexity, a separate rate-based model was used to train on more complicated machine learning tasks like MNIST, CIFAR-10 and ImageNet. A mathematical analysis was done by Alexandre Payeur to demonstrate that the rate-based model is a good approximation to the spike-based model under certain assumptions. Ideally, the same spiking model would be used for all experiments; given Chapter 6. Discussion 95 current computational limitations, this was not possible.

Training a deep network on ImageNet using the burstprop learning rule proved to be another sig- nificant challenge. ImageNet is one of the largest machine learning datasets, comprising of over 1 million training and testing images with 1000 classes. Learning to classify these images requires the use of high-dimensional networks with many layers. Recently, architectures like residual neural networks (ResNet) have emerged as much more efficient ways to train on datasets like ImageNet than traditional fully-connected networks [89]. However, adapting the burstprop learning rule to such an architecture is not trivial; given the time and resource constraints of the project, a fully-connected network was used instead. Training was done on GPU clusters to speed up training and hyperparameter optimization. However, the large size of the dataset, combined with the additional hyperparameters introduced in burstprop, such as feedback and recurrent weight initialization, recurrent learning rates, and layer-wise feedforward learning rates, meant that finding optimal hyperparameters was a highly time-consuming process. This likely affected the final results that showed impaired performance compared to backprop. Given additional time and resources, a much more exhaustive hyperparameter search might reveal a more optimal set of hyperparameters.

Finally, adapting the burstprop learning rule to be compatible with different activation functions such as ReLU and softmax, commonly used in machine learning models, also proved to be challenging. The difficulty was in modifying the feedback inputs u to units in the rate-based model with additional factors such that the burst rates reflected the derivatives of the activation functions. This was further complicated by application of a nonlinear function (sigmoid) to u when computing the burst probabil- ities, and the multiplication by the event rate e to compute the burst rates. In the final formulation, the learning rules use approximations to these derivatives, but these approximations are skewed by the sigmoid function. In the case of softmax, the approximation fails when the event rate of an ensemble is very small; in practice, this was not an issue, and event rates can be guaranteed to be above a certain value by introducing a bias term. Furthermore, the use of recurrent inputs to keep u in the linear regime of the sigmoid function avoided issues with approximating the derivative. Nevertheless, the lack of a guarantee that the burstprop model is compatible with any nonlinear activation function remains a limitation of the model.

6.1.3 Project 3: Spike-based causal inference for weight alignment

The third project (Chapter 5) successfully demonstrates that the spiking of a neuron can be used by feedback synapses to estimate causal effects and learn weight symmetry. However, it also has a number of limitations. First, the learning rule requires a synapse to keep track of a large amount of variables, including all of the coefficients of the piecewise-linear model. While certainly possible in real neurons, it is unclear what the cellular substrates of these variables would be, and a more optimal learning rule would require fewer variables. Recent work has proposed alternative learning rules for achieving sym- metry in feedback weights that require fewer variables [265].

A second limitation in the model is absence of diversity in neuron types and connection motifs. Be- cause both the feedforward and feedback weights in the model can be either positive or negative, the Chapter 6. Discussion 96 neurons in the model can have an excitatory effect on some post-synaptic neurons and an inhibitory effect on others. This is compatible with Dale’s principle if an excitatory neuron activates inhibitory interneurons that in turn inhibit downstream neurons, but in this case inhibitory interneurons were not explicitly modeled. Furthermore, all neurons in a layer of the network were fully-connected to neurons in the downstream layer, which does not capture the complex circuitry present the brain, including sparse connectivity patterns, recurrent connections and long-range connections. In theory, the learning rule presented in this work is agnostic to the connectivity patterns in the network, and should capable of obtaining an unbiased estimate of causal effect in a more complex network structure, but this remains to be explored in future work.

As in the first two projects, the simulation of spiking neurons in the final project led to computational challenges. In order to deal with this, for each experiment, an additional rate-based network was used to train feedforward weights, while the spiking network was only used to train feedback weights using the RDD-based learning rule. This required training the feedforward and feedback weights in a network in distinct phases separated in time. While the feedback weight learning rule is theoretically compatible with simultaneous plasticity of feedforward and feedback weights, this could not be demonstrated due to the use of two separate networks. Training a fully spiking neural network on a complex machine learn- ing task such as CIFAR-10 using a feedback weight plasticity rule is a potential avenue for future research.

6.2 Experimental predictions

Each of the three projects in this thesis presents a unique set of experimental predictions. Below, these predictions are listed, separated by project.

6.2.1 Project 1: Towards deep learning with segregated dendrites

The model presented in Project 1 (Chapter 3) makes two primary experimental predictions. First, it pre- dicts that feedforward sensory information is communicated at basal synapses, while top-down feedback targets distal apical dendrites. This prediction is consistent with previous experimental work [64, 63]. Second, it predicts that apical dendritic activity determines plasticity at basal synapses. More specifi- cally, this model predicts that relative timing between inputs at basal and apical dendrites determines the sign and magnitude of plasticity at basal synapses. Some experimental work has provided evidence for this [266, 137]. Future experiments could further investigate the effect of different timings of inputs at apical dendrites relative to basal inputs on plasticity.

6.2.2 Project 2: Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits

The main prediction of the model presented in Project 2 (Chapter 4) is that ensembles of pyramidal neurons encode apical inputs in their burst probability and basal inputs in their event rate. The model also predicts that the running average of the burst probability of a neuron determines the threshold for LTP. As well, it predicts STD at basal synapses carrying bottom-up information and STF at apical Chapter 6. Discussion 97 synapses receiving top-down feedback signals. Another important prediction is that the burst proba- bility carries information about learning errors, and therefore the variance in burst probability should decrease as a task is being learned. Finally, the model predicts that inhibition at apical dendrites plays a homeostatic role in regulating burst probabilities and improves learning, and therefore, disruption of inhibitory inputs at apical dendrites should cause changes in burst probabilities and impair learning of a task.

6.2.3 Project 3: Spike-based causal inference for weight alignment

The third project (Chapter 5) predicts that synapses receiving top-down feedback information undergo plasticity such that the synaptic strength reflects the causal effect of the post-synaptic neuron on the pre-synaptic activity. Testing this prediction experimentally should be possible with modern tools, by isolating a pyramidal neuron (neuron A) that receives top-down apical input from another neuron (neuron B), measuring the effect of neuron A spiking on the activity of neuron B, and comparing that to the strength of the synapse between neuron B and neuron A. If done for a large amount of synaptic connections, such an experiment could provide important insight into whether weight symmetry between bottom-up and top-down connections occurs in cortical circuits.

6.3 Future directions

While the work presented in this thesis, and research on biologically plausible gradient descent as a whole, has made significant strides toward answering the question of credit assignment in hierarchical circuits, many questions remain that future research can address, some of which are discussed below.

6.3.1 Learning modalities and network architectures

The work presented in this thesis has focused solely on supervised learning tasks, where a teaching signal is presented to the network and generates a rich error signal that drives plasticity. In reality, it is widely believed that cortical regions such as visual and auditory cortex must mostly perform unsupervised learning by learning useful representations of sensory inputs without external supervisory signals. A common theory is that this is done using predictive coding, where the brain generates predictions of sensory input and generates internal prediction errors [267]. An interesting area of future research would be to investigate how biologically plausible learning rules that approximate gradient descent, such as those presented here, can be adapted to a predictive coding paradigm. Recent models by Whittington et al. [39] and Jo˜aoet al. [268] have begun to investigate biological approximations to gradient descent in the context of predictive coding and unsupervised learning. In addition, reward-based learning, whose machine learning analog is reinforcement learning, has been well-characterized in the brain. Translating biologically plausible gradient descent learning rules to deep reinforcement learning tasks is another po- tential direction for future research. Furthermore, the models presented here, as well as most models of biologically plausible gradient descent, are trained on image classification tasks, since these are standard benchmarks in the field of machine learning. In machine learning, backpropagation of error has been applied to learn a wide variety of learning tasks that the brain is capable of, such as natural language processing, object detection and motor control [97, 93, 91]. These tasks are solved in machine learning models using different network architectures incorporating recurrent and/or skip connections, and spe- Chapter 6. Discussion 98 cialized computational units such as LSTMs and memory units, as well as eligibility traces. Future work should focus on broadening the scope of learning modalities and network architectures that biologically plausible learning rules are applied to.

6.3.2 Teaching and neuromodulatory signals

The supervised learning models explored in this thesis make use of externally-generated teaching signals for generating errors that are used for plasticity throughout the networks. Cellular mechanisms for reinforcement learning have been extensively studied, and dopamine neurons have been implicated in the generation of reward prediction errors that drive this form of learning [78]. However, whether, and how, rich error signals of the form used in supervised learning are generated in the brain remains an outstanding question in the field of neuroscience. In addition, in all three models presented here, weight updates occur only once a teaching signal is present. This implies the existence of a global neuromodulatory signal that gates plasticity throughout a network. While neuromodulatory signals have been shown to modulate LTP and LTD [79, 78], including changing the sign of plasticity, the existence of neuromodulators that act as gating factors for plasticity in an entire circuit is unknown and can be a fruitful avenue for future experimental work. On the computational side, the development of models that are able to learn without such gating signals warrants further investigation.

6.3.3 Neuron types, connectivity motifs and cortical layers

There exist a large variety of neuron types and connectivity patterns in the brain. The first project pre- sented here focuses solely on the ability of cortical pyramidal neurons with segregated apical dendrites to perform gradient descent, since pyramidal neurons are believed to be the primary neurons involved in sensory processing and learning in the cortex. However, experimental work has demonstrated that inhibitory neurons play an important role in modulating cortical activity [51, 14]. The spiking model presented in the second project addresses this this by incorporating PV+ and SOM+ interneurons that provide feedforward inhibition, as well as disinhibiton via VIP+ interneurons that controls the slope of the burst probability of an ensemble. Interneurons are also incorporated in some other models of biolog- ically plausible gradient descent, such as the model by Jo˜aoet al. [268], where SOM+ interneurons learn to cancel top-down feedback at apical dendrites. In our rate-based model, PV+ and SOM+ interneurons are not explicitly modeled, for the sake of simplicity. The learning rule presented in the third project is agnostic to neuron type, however the co-existence of positive and negative weights on post-synaptic targets implies inhibitory interneurons that are not explicitly modeled. While PV+, VIP+, and SOM+ interneurons make up the majority of inhibitory neurons in the cortex [50], other interneurons, such as neurogliaform cells, are not represented by these models, and could be incorporated in future work.

Projects 1 and 2 utilize pyramidal neurons with electrotonically segregated apical dendrites. This characteristic has been identified in pyramidal neurons in layers 2/3 and 5 of cortex [62, 60] and in the hippocampus [53]. In particular, the spiking neuron models used in project 2 were fit specifically to data from layer 5 cells. It is still not clear whether the learning rules presented in these projects are compatible with pyramidal neurons in other cortical layers or other brain regions. Additionally, aside from convolutional layers, the layers in the networks presented in this thesis are all fully-connected and Chapter 6. Discussion 99 remain so during learning. Future work could investigate learning in networks with sprase activity and connectivity constraints, as well as incorporating synaptic pruning into learning rules, to more closely resemble what has been observed in the mammalian brain. Finally, another gap in the models presented here is that they do not explore the roles of cortical layers, which feature quite distinct feedforward, feedback and recurrent connectivity patterns. Exploring the potential roles of different neuron types and cortical layers for performing gradient descent in hierarchical networks, and accounting for variations in shape, physiology and connectivity within neuron types, is another avenue for future research.

6.4 Concluding remarks

Over the past decade, gradient descent in deep neural networks has revolutionized the field of machine learning, and enabled machine learning models to reach human-level performance in a variety of learning tasks and modalities. Gradient descent is the most powerful algorithm for learning that we know of, and the only machine learning algorithm shown to be capable of learning a variety of complex tasks that mammals excel at. Recent studies showing similarities between feature representations in the mammalian brain and those learned via gradient descent [34, 35, 32, 36] provide experimental evidence for the theory that the brain performs gradient descent, and have motivated computational theories of gradient descent in the brain. How the brain might accomplish gradient descent remains an open question in neuroscience. This work attempts to address this question, by presenting novel theories for how the unique properties of cortical neurons can enable gradient descent in hierarchical networks. Each of the projects presented here focuses on a set of biological considerations that must be accounted for in any fully biologically plausible algorithm for gradient descent. While each of the models presented here still contains its own sets of biological implausibilities, the aim of this work as a whole is to provide solid foundations that can be built upon in future theoretical work, and to generate predictions about neural physiology and circuitry that can motivate directions for future experimental work. Investigating how the brain might implement gradient descent given the unique computational constraints that emerge as a consequence of brain physiology is an important step towards the ultimate goal of understanding of learning in the brain. Glossary

artificial neural network (ANN) A machine learning model inspired by biological neural networks that uses a collection of connected units that apply a nonlinear transformation of a weighted sum of their inputs to produce real-valued outputs. 10, 12, 13, 24, 35, 36, 43, 66, 81 backpropagation of error (backprop) An efficient method for stochastic gradient descent that takes advantage of the chain rule for partial derivatives. 10, 12, 13, 15–17, 19, 22, 35–37, 40, 43, 55, 56, 60, 65, 67–69, 77, 88, 92, 93, 95, 97, 129–134, 142, 147, 149 calcium plateau potential (Ca2+ spike) A large action potential-generating depolarizing current caused by the activation of voltage-gated calcium channels at an initiation zone around the main bifurcation point of apical dendrites of pyramidal neurons, observed in layer 5 of cortex and in CA1 of hippocampus [66, 61, 60]. 3, 6, 18, 25, 27–31, 33, 40–43, 47, 49, 50, 52, 93, 94 credit assignment problem The problem of assigning proper credit, or blame, to each neuron in a multi-layer hierarchical network for its contribution to a particular behavioral output, in order to undergo plasticity that leads to an improvement in behavior. 1, 2, 18, 21–26, 30, 31, 37, 40, 41, 43, 44, 55–57, 61, 64, 69, 70, 92, 97 feedback alignment A gradient descent algorithm introduced by Lillicrap et al. [46] that uses fixed, random feedback weights, rather than symmetric weights. Feedback alignment refers to the effect of feedforward weights adapting to become aligned with feedback weights during training. 3, 13, 14, 18, 77, 81, 82, 84, 85, 87, 88, 90–92, 133, 147, 149 regression discontinuity design (RDD) An experimental design for estimating the causal effect of an intervention by comparing observations lying on either side of the intervention threshold. 4, 19, 82–91, 93, 96, 152, 153 stochastic gradient descent (SGD) The process of finding the local minimum of a differentiable function by taking steps that are proportional to the negative of the gradient of the function at the current point. 1–3, 6, 10–13, 15–19, 28, 31, 48, 65, 66, 69, 79, 81, 84, 87, 92–94, 97–99, 129, 132, 142 weight transport problem The biological implausibility of backpropagation of error’s requirement for every unit in a neural network to receive a signal of all downstream weights in the network in order to perform gradient descent. 2, 3, 13, 22, 24, 30, 81–84, 88, 92, 93, 131

100 Bibliography

[1] Huibert D. Mansvelder, Matthijs B. Verhoog, and Natalia A. Goriounova. “Synaptic plasticity in human cortical circuits: cellular mechanisms of learning and memory in the human brain?” In: Current Opinion in Neurobiology 54 (2019), pp. 186–193. [2] Yann Humeau and Daniel Choquet. “The next generation of approaches to investigate the link between synaptic plasticity and learning”. In: Nature Neuroscience 22.10 (2019), pp. 1536–1543. [3] Robert C Malenka and Mark F Bear. “LTP and LTD: an embarrassment of riches”. In: Neuron 44.1 (2004), pp. 5–21. [4] Karri Lamsa and Petrina Lau. “Long-term plasticity of hippocampal interneurons during in vivo memory processes”. In: Current Opinion in Neurobiology 54 (2019), pp. 20–27. [5] Richard H. Roth et al. “Cortical synaptic AMPA receptor plasticity during motor learning”. In: Neuron 105.5 (2020), pp. 895–908. [6] Tim V. P. Bliss and Graham L. Collingridge. “A synaptic model of memory: long-term potenti- ation in the hippocampus”. In: Nature 361.6407 (1993), pp. 31–39. [7] Pieter R. Roelfsema and Anthony Holtmaat. “Control of synaptic plasticity in deep cortical networks”. In: Nature Reviews Neuroscience 19.3 (2018), p. 166. [8] Tim V.P. Bliss and Terje Lømo. “Long-lasting potentation of synaptic transmission in the dendate area of anaesthetized rabbit following stimulation of the perforant path”. In: Journal of Physiology 232 (1973), pp. 351–356. [9] C. Luscher and R. C. Malenka. “NMDA Receptor-Dependent Long-Term Potentiation and Long- Term Depression (LTP/LTD)”. In: Cold Spring Harbor Perspectives in Biology 4.6 (June 2012), a005710–a005710. issn: 1943-0264. doi: 10.1101/cshperspect.a005710. [10] Skyler L. Jackman and Wade G. Regehr. “The mechanisms and functions of synaptic facilitation”. In: Neuron 94.3 (2017), pp. 447–464. [11] Zuzanna Brzosko, Susanna B. Mierau, and Ole Paulsen. “Neuromodulation of Spike-Timing- Dependent plasticity: past, present, and future”. In: Neuron 103.4 (2019), pp. 563–581. [12] Qiang Gu. “Neuromodulatory transmitter systems in the cortex and their role in cortical plastic- ity”. In: Neuroscience 111.4 (2002), pp. 815–835. [13] Klaus-Peter Lesch and Jonas Waider. “Serotonin in the modulation of neural plasticity and net- works: implications for neurodevelopmental disorders”. In: Neuron 76.1 (2012), pp. 175–191. [14] Robert C. Froemke. “Plasticity of cortical excitatory-inhibitory balance”. In: Annual Review of Neuroscience 38 (2015), pp. 195–219.

101 BIBLIOGRAPHY 102

[15] L Abbott and S Nelson. “Synaptic plasticity: taming the beast”. In: Nature Neuroscience (Jan. 2000). [16] Natalia Caporale and Yang Dan. “Spike timing-dependent plasticity: a Hebbian learning rule”. In: Annual Review of Neuroscience 31 (2008), pp. 25–46. [17] Johannes J. Letzkus, Bj¨ornM. Kampa, and Greg J. Stuart. “Learning rules for spike timing- dependent plasticity depend on dendritic synapse location”. In: Journal of Neuroscience 26.41 (2006), pp. 10420–10429. [18] Yang Dan and Mu-Ming Poo. “Spike Timing-Dependent Plasticity: From Synapse to Perception”. In: Physiological Reviews 86.3 (2006), p. 1033. [19] Karri P. Lamsa, Dimitri M. Kullmann, and Melanie A. Woodin. “Spike-timing dependent plas- ticity in inhibitory circuits”. In: Frontiers in Synaptic Neuroscience 2 (2010), p. 8. [20] Klaus Obermayer, Terrence Sejnowski, and Gary G. Blasdel. “Neural pattern formation via a competitive Hebbian mechanism”. In: Behavioural Brain Research 66.1–2 (Jan. 1995), pp. 161– 167. issn: 01664328. doi: 10.1016/0166-4328(94)00136-4. [21] and J. Leo van Hemmen. “Associative memory in a network of ‘spiking’ neu- rons”. In: Network: Computation in Neural Systems 3.2 (Jan. 1992), pp. 139–164. issn: 0954-898X, 1361-6536. doi: 10.1088/0954-898X_3_2_004. [22] Erkki Oja. “Simplified neuron model as a principal component analyzer”. In: Journal of Math- ematical Biology 15.3 (Nov. 1982), pp. 267–273. issn: 0303-6812, 1432-1416. doi: 10 . 1007 / BF00275687. [23] Aapo Hyv¨arinenand Erkki Oja. “Independent component analysis by general nonlinear Hebbian- like learning rules”. In: Signal Processing 64.3 (Feb. 1998), pp. 301–313. issn: 01651684. doi: 10.1016/S0165-1684(97)00197-7. [24] Teuvo Kohonen. “Physiological interpretation of the Self-Organizing Map algorithm”. In: Neural Networks 6.7 (Jan. 1993), pp. 895–905. issn: 08936080. doi: 10.1016/S0893-6080(09)80001-4. [25] Adam H. Marblestone, Greg Wayne, and Konrad P. K¨ording.“Toward an integration of deep learning and neuroscience”. In: Frontiers in Computational Neuroscience 10 (2016), p. 94. [26] Johanni Brea and Wulfram Gerstner. “Does computational neuroscience need new synaptic learn- ing paradigms?” In: Current opinion in Behavioral Sciences 11 (2016), pp. 61–66. [27] Blake A. Richards et al. “A deep learning framework for neuroscience”. In: Nature Neuroscience 22.11 (2019), pp. 1761–1770. [28] Justin Werfel, Xiaohui Xie, and H. Sebastian Seung. “Learning curves for stochastic gradient descent in linear feedforward networks”. In: Advances in Neural Information Processing Systems. 2004, pp. 1197–1204. [29] Kaiming He et al. “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification”. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, pp. 1026–1034. [30] Tien Ho-Phuoc. CIFAR10 to Compare Visual Recognition Performance between Deep Neural Networks and Humans. 2019. arXiv: 1811.07270 [cs.CV]. BIBLIOGRAPHY 103

[31] David Silver et al. “Mastering the game of go without human knowledge”. In: Nature 550.7676 (2017), pp. 354–359. [32] Daniel L. K. Yamins and James J. DiCarlo. “Using goal-driven deep learning models to understand sensory cortex”. In: Nature Neuroscience 19.3 (2016), pp. 356–365. [33] Jack Lindsey et al. A Unified Theory of Early Visual Representations from Retina to Cortex through Anatomically Constrained Deep CNNs. 2019. arXiv: 1901.00945 [q-bio.NC]. [34] Seyed-Mahdi Khaligh-Razavi and Nikolaus Kriegeskorte. “Deep Supervised, but Not Unsuper- vised, Models May Explain IT Cortical Representation”. In: PLoS Comput Biol 10.11 (Nov. 6, 2014), e1003915. doi: 10.1371/journal.pcbi.1003915. url: http://dx.doi.org/10.1371% 2Fjournal.pcbi.1003915. [35] Charles F. Cadieu et al. “Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition”. In: PLoS Comput Biol 10.12 (Dec. 18, 2014), e1003963. doi: 10.1371/journal.pcbi.1003963. url: http://dx.doi.org/10.1371%2Fjournal.pcbi. 1003963. [36] Jonas Kubilius, Stefania Bracci, and Hans P. Op de Beeck. “Deep Neural Networks as a Computa- tional Model for Human Shape Sensitivity”. In: PLoS Comput Biol 12.4 (Apr. 28, 2016), e1004896. doi: 10.1371/journal.pcbi.1004896. url: http://dx.doi.org/10.1371%2Fjournal.pcbi. 1004896. [37] Yoshua Bengio et al. Towards Biologically Plausible Deep Learning. 2016. arXiv: 1502.04156 [cs.LG]. [38] Francis Crick. “The recent excitement about neural networks”. In: Nature 337.6203 (Jan. 1989), pp. 129–132. doi: 10.1038/337129a0. url: http://dx.doi.org/10.1038/337129a0. [39] James C. R. Whittington and Rafal Bogacz. “An approximation of the error backpropagation algorithm in a predictive coding network with local hebbian synaptic plasticity”. In: Neural Com- putation 29.5 (2017), pp. 1229–1262. [40] Thomas Mesnard, Wulfram Gerstner, and Johanni Brea. Towards deep learning with spiking neurons in energy based models with contrastive Hebbian plasticity. 2016. arXiv: 1612.03214 [cs.LG]. [41] Benjamin Scellier and Yoshua Bengio. “Equilibrium propagation: Bridging the gap between energy-based models and backpropagation”. In: Frontiers in Computational Neuroscience 11 (2017), p. 24. [42] Mohamed Akrout et al. “Deep learning without weight transport”. In: Advances in Neural Infor- mation Processing Systems. 2019, pp. 976–984. [43] Jordan Guerguiev, Timothy P. Lillicrap, and Blake A. Richards. “Towards Deep Learning with Segregated Dendrites”. In: eLife 6 (2017), e22901. [44] Alexandre Payeur, Jean-Claude B´eıque,and Richard Naud. “Classes of dendritic information processing”. In: Current Opinion in Neurobiology 58 (2019), pp. 78–85. [45] Jordan Guerguiev, Konrad P. K¨ording,and Blake A. Richards. Spike-based causal inference for weight alignment. 2020. arXiv: 1910.01689 [q-bio.NC]. BIBLIOGRAPHY 104

[46] Timothy P Lillicrap et al. “Random synaptic feedback weights support error backpropagation for deep learning”. In: Nature Communications 7.1 (2016), pp. 1–10. [47] Olivia K. Swanson and Arianna Maffei. “From hiring to firing: activation of inhibitory neurons and their recruitment in behavior”. In: Frontiers in Molecular Neuroscience 12 (2019), p. 168. [48] Iryna Yavorska and Michael Wehr. “Somatostatin-expressing inhibitory interneurons in cortical circuits”. In: Frontiers in Neural Circuits 10 (2016), p. 76. [49] Mahesh M. Karnani et al. “Opening Holes in the Blanket of Inhibition: Localized Lateral Disin- hibition by VIP Interneurons”. In: The Journal of Neuroscience 36.12 (Mar. 23, 2016), pp. 3471– 3480. doi: 10.1523/JNEUROSCI.3646-15.2016. url: http://www.jneurosci.org/content/ 36/12/3471.abstract. [50] Bernardo Rudy et al. “Three groups of interneurons account for nearly 100% of neocortical GABAergic neurons”. In: Developmental Neurobiology 71.1 (2011), pp. 45–61. [51] Ryoma Hattori et al. “Functions and dysfunctions of neocortical inhibitory neuron subtypes”. In: Nature Neuroscience 20.9 (2017), p. 1199. [52] Katherine C. Wood, Jennifer M. Blackwell, and Maria Neimark Geffen. “Cortical inhibitory in- terneurons control sensory processing”. In: Current Opinion in Neurobiology 46 (2017), pp. 200– 207. [53] Nelson Spruston. “Pyramidal neurons: dendritic structure and synaptic integration”. In: Nature Reviews Neuroscience 9.3 (Mar. 2008), pp. 206–221. issn: 1471-003X, 1471-0048. doi: 10.1038/ nrn2286. [54] Charles R. Gerfen, Michael N. Economo, and Jayaram Chandrashekar. “Long distance projections of cortical pyramidal neurons”. In: Journal of Neuroscience Research 96.9 (2018), pp. 1467–1475. doi: 10.1002/jnr.23978. eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/jnr. 23978. url: https://onlinelibrary.wiley.com/doi/abs/10.1002/jnr.23978. [55] Amir Zarrinpar and Edward M Callaway. “Local connections to specific types of layer 6 neurons in the rat visual cortex”. In: Journal of neurophysiology 95.3 (2006), pp. 1751–1761. [56] Juhyun Kim et al. “Layer 6 corticothalamic neurons activate a cortical output layer, layer 5a”. In: Journal of Neuroscience 34.29 (2014), pp. 9656–9664. [57] Ying-Wan Lam and S Murray Sherman. “Functional organization of the somatosensory cortical layer 6 feedback to the thalamus”. In: Cerebral cortex 20.1 (2010), pp. 13–24. [58] Edward Zagha et al. “Motor cortex feedback influences sensory processing by modulating network state”. In: Neuron 79.3 (2013), pp. 567–578. [59] Pierre Veinante and Martin Deschˆenes.“Single-cell study of motor cortex projections to the barrel field in rats”. In: Journal of Comparative Neurology 464.1 (2003), pp. 98–103. [60] Matthew E. Larkum, J. Julius Zhu, and Bert Sakmann. “A new cellular mechanism for coupling inputs arriving at different cortical layers”. In: Nature 398.6725 (Mar. 1999), pp. 338–341. issn: 0028-0836. doi: 10.1038/18686. url: http://dx.doi.org/10.1038/18686. [61] Matthew E. Larkum et al. “Synaptic integration in tuft dendrites of layer 5 pyramidal neurons: a new unifying principle”. In: Science (Jan. 2009). BIBLIOGRAPHY 105

[62] Jack Waters et al. “Supralinear Ca2+ influx into dendritic tufts of layer 2/3 neocortical pyramidal neurons in vitro and in vivo”. In: Journal of Neuroscience 23 (24 2003), pp. 8558–8567. [63] Matthew Larkum. “A cellular mechanism for cortical associations: an organizing principle for the cerebral cortex”. In: Trends in Neurosciences 36.3 (Mar. 2013), pp. 141–151. issn: 0166-2236. doi: 10.1016/j.tins.2012.11.006. url: http://www.sciencedirect.com/science/article/ pii/S0166223612002032. [64] Satoshi Manita et al. “A Top-Down Cortical Circuit for Accurate Sensory Perception”. In: Neuron 86.5 (2015), pp. 1304–1316. doi: 10.1016/j.neuron.2015.05.006. url: http://dx.doi.org/ 10.1016/j.neuron.2015.05.006 (visited on 06/04/2015). [65] Matthew E. Larkum, Walter Senn, and Hans-R. Luscher. “Top-down dendritic input increases the gain of layer 5 pyramidal neurons.” In: Cerebral Cortex 14.10 (2004), pp. 1059–1070. [66] Adam S. Shai et al. “Physiology of layer 5 pyramidal neurons in mouse primary visual cortex: coincidence detection through bursting”. In: PLoS Computational Biology 11.3 (2015), e1004090. [67] Bj¨ornM. Kampa, Johannes J. Letzkus, and Greg J. Stuart. “Requirement of dendritic calcium spikes for induction of spike-timing-dependent synaptic plasticity: Dendritic spikes controlling STDP”. In: The Journal of Physiology 574.1 (July 2006), pp. 283–290. issn: 00223751. doi: 10.1113/jphysiol.2006.111062. [68] Ole Paulsen and Terrence J. Sejnowski. “Natural patterns of activity and long-term synaptic plasticity”. In: Curr Opin Neurobiol 10.2 (2000), pp. 172–180. [69] Fr´ed´ericGambino et al. “Sensory-evoked LTP driven by dendritic plateau potentials in vivo”. In: Nature 515.7525 (2014), pp. 116–119. [70] Robert S. Zucker and Wade G. Regehr. “Short-term synaptic plasticity”. In: Annual Review of Physiology 64.1 (2002), pp. 355–405. [71] Wade G Regehr. “Short-term presynaptic plasticity”. In: Cold Spring Harbor Perspectives in Biology 4.7 (2012), a005702. [72] John Lisman. “Glutamatergic synapses are structurally and biochemically complex because of multiple plasticity processes: long-term potentiation, long-term depression, short-term potenti- ation and scaling”. In: Philosophical Transactions of the Royal Society B: Biological Sciences 372.1715 (2017), p. 20160260. [73] Kaiwen He et al. “Distinct eligibility traces for LTP and LTD in cortical synapses”. In: Neuron 88.3 (2015), pp. 528–538. [74] Mark F. Bear and Robert C. Malenka. “Synaptic plasticity: LTP and LTD”. In: Current Opinion in Neurobiology 4 (1994), pp. 389–399. [75] Pablo E. Castillo, Chiayu Q. Chiu, and Reed C. Carroll. “Long-term plasticity at inhibitory synapses”. In: Current Opinion in Neurobiology 21.2 (Apr. 2011), pp. 328–338. issn: 09594388. doi: 10.1016/j.conb.2011.01.006. [76] Dimitri M. Kullmann and Karri P. Lamsa. “LTP and LTD in cortical GABAergic interneurons: Emerging rules and roles”. In: Neuropharmacology 60.5 (Apr. 2011), pp. 712–719. issn: 00283908. doi: 10.1016/j.neuropharm.2010.12.020. BIBLIOGRAPHY 106

[77] Nicolas Fr´emaux and Wulfram Gerstner. “Neuromodulated Spike-Timing-Dependent Plasticity, and Theory of Three-Factor Learning Rules”. In: Frontiers in Neural Circuits 9 (Jan. 2016). issn: 1662-5110. doi: 10.3389/fncir.2015.00085. url: http://journal.frontiersin.org/ Article/10.3389/fncir.2015.00085/abstract. [78] Wolfram Schultz. “Updating dopamine reward signals”. In: Current Opinion in Neurobiology 23.2 (2013). Macrocircuits, pp. 229–238. issn: 0959-4388. doi: https://doi.org/10.1016/ j . conb . 2012 . 11 . 012. url: http : / / www . sciencedirect . com / science / article / pii / S0959438812001869. [79] Verena Pawlak. “Timing is not everything: neuromodulation opens the STDP gate”. In: Frontiers in Synaptic Neuroscience 2 (2010). issn: 16633563. doi: 10.3389/fnsyn.2010.00146. url: http://journal.frontiersin.org/article/10.3389/fnsyn.2010.00146/abstract. [80] Sho Yagishita et al. “A critical time window for dopamine actions on the structural plasticity of dendritic spines”. In: Science 345.6204 (2014), pp. 1616–1620. issn: 0036-8075. doi: 10.1126/ science.1255514. eprint: https://science.sciencemag.org/content/345/6204/1616.full. pdf. url: https://science.sciencemag.org/content/345/6204/1616. [81] Zuzanna Brzosko, Wolfram Schultz, and Ole Paulsen. “Retroactive modulation of spike timing- dependent plasticity by dopamine”. In: eLife 4 (Oct. 2015). Ed. by Marlene Bartos, e09685. issn: 2050-084X. doi: 10.7554/eLife.09685. url: https://doi.org/10.7554/eLife.09685. [82] Jean Harb and Doina Precup. Investigating Recurrence and Eligibility Traces in Deep Q-Networks. 2017. arXiv: 1704.05495 [cs.AI]. [83] Per Jesper Sj¨ostr¨om,Gina G. Turrigiano, and Sacha B. Nelson. “Rate, Timing, and Cooperativity Jointly Determine Cortical Synaptic Plasticity”. In: Neuron 32.6 (2001), pp. 1149–1164. issn: 0896-6273. doi: https://doi.org/10.1016/S0896- 6273(01)00542- 6. url: http://www. sciencedirect.com/science/article/pii/S0896627301005426. [84] Tim P. Vogels et al. “Inhibitory synaptic plasticity: spike timing-dependence and putative network function”. In: Frontiers in Neural Circuits 7 (2013), p. 119. [85] Y. Dan and M.-M. Poo. “Spike timing-dependent plasticity of neural circuits”. In: Neuron 44.1 (2004), pp. 23–30. issn: 0896-6273. doi: 10.1016/j.neuron.2004.09.007. url: http://dx. doi.org/10.1016/j.neuron.2004.09.007. [86] Per Jesper Sj¨ostr¨om,Gina G Turrigiano, and Sacha B Nelson. “Neocortical LTD via coincident activation of presynaptic NMDA and cannabinoid receptors”. In: Neuron 39.4 (2003), pp. 641– 654. [87] Melanie A. Woodin, Karunesh Ganguly, and Mu-ming Poo. “Coincident pre-and postsynaptic activity modifies GABAergic synapses by postsynaptic changes in Cl- transporter activity”. In: Neuron 39.5 (2003), pp. 807–820. [88] Trevor Balena, Brooke A. Acton, and Melanie A. Woodin. “GABAergic synaptic transmission regulates calcium influx during spike-timing dependent plasticity”. In: Frontiers in Synaptic Neu- roscience 2 (2010), p. 16. [89] Kaiming He et al. “Deep residual learning for image recognition”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, pp. 770–778. BIBLIOGRAPHY 107

[90] Joseph Redmon et al. “You only look once: Unified, real-time object detection”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, pp. 779–788. [91] Ke Sun et al. “Deep high-resolution representation learning for human pose estimation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019, pp. 5693– 5703. [92] Ali Bou Nassif et al. “Speech recognition using deep neural networks: A systematic review”. In: IEEE Access 7 (2019), pp. 19143–19165. [93] Jeremy Howard and Sebastian Ruder. Universal Language Model Fine-tuning for Text Classifi- cation. 2018. arXiv: 1801.06146 [cs.CL]. [94] Guillaume Lample et al. Unsupervised Machine Translation Using Monolingual Corpora Only. 2018. arXiv: 1711.00043 [cs.CL]. [95] Yinhan Liu et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. 2019. arXiv: 1907.11692 [cs.CL]. [96] Qing Rao and Jelena Frtunikj. “Deep learning for self-driving cars: chances and challenges”. In: Proceedings of the 1st International Workshop on Software Engineering for AI in Autonomous Systems. 2018, pp. 35–38. [97] Nicolas Heess et al. Emergence of Locomotion Behaviours in Rich Environments. 2017. arXiv: 1707.02286 [cs.AI]. [98] Sho Sonoda and Noboru Murata. “Neural network with unbounded activation functions is uni- versal approximator”. In: Applied and Computational Harmonic Analysis 43.2 (2017), pp. 233– 268. [99] Andrew T. Smith et al. “Estimating receptive field size from fMRI data in human striate and extrastriate visual cortex”. In: Cerebral Cortex 11.12 (2001), pp. 1182–1190. [100] Sergey Bartunov et al. “Assessing the scalability of biologically-motivated deep learning algo- rithms and architectures”. In: Advances in Neural Information Processing Systems. 2018, pp. 9368– 9378. [101] Yali Amit. “Deep learning with asymmetric connections and Hebbian updates”. In: Frontiers in Computational Neuroscience 13 (2019), p. 18. [102] Theodore H. Moskovitz, Ashok Litwin-Kumar, and L. F. Abbott. Feedback alignment in deep convolutional networks. 2019. arXiv: 1812.06488 [cs.NE]. [103] Will Xiao et al. Biologically-plausible learning algorithms can scale to large datasets. 2018. arXiv: 1811.03567 [cs.LG]. [104] Qianli Liao, Joel Z. Leibo, and Tomaso Poggio. “How important is weight symmetry in back- propagation?” In: Thirtieth AAAI Conference on Artificial Intelligence. 2016. [105] Daniel Kunin et al. Loss Landscapes of Regularized Linear Autoencoders. 2019. arXiv: 1901.08168 [cs.LG]. [106] John F. Kolen and Jordan B. Pollack. “Backpropagation without weight transport”. In: Proceed- ings of 1994 IEEE International Conference on Neural Networks (ICNN’94). Vol. 3. IEEE. 1994, pp. 1375–1380. BIBLIOGRAPHY 108

[107] Georgios Detorakis, Travis Bartley, and Emre Neftci. “Contrastive Hebbian learning with random feedback weights”. In: Neural Networks 114 (2019), pp. 1–14. [108] Xiaohui Xie and H. Sebastian Seung. “Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network”. In: Neural Computation 15.2 (Feb. 2003), pp. 441–454. issn: 0899-7667, 1530-888X. doi: 10.1162/089976603762552988. [109] Axel Laborieux et al. Scaling Equilibrium Propagation to Deep ConvNets by Drastically Reducing its Gradient Estimator Bias. 2020. arXiv: 2006.03824 [cs.NE]. [110] Dong-Hyun Lee et al. “Difference target propagation”. In: Joint European Conference on Ma- chine Learning and Knowledge Discovery in Databases. Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2015, pp. 498–515. [111] Robert Urbanczik and Walter Senn. “Learning by the Dendritic Prediction of Somatic Spiking”. In: Neuron 81.3 (Feb. 5, 2014), pp. 521–528. issn: 0896-6273. doi: 10.1016/j.neuron.2013.11. 030. url: http://www.sciencedirect.com/science/article/pii/S0896627313011276. [112] Claudia Clopath and Wulfram Gerstner. “Voltage and spike timing interact in STDP–a unified model”. In: Frontiers in Synaptic Neuroscience 2 (2010), p. 25. [113] Claudia Clopath et al. “Connectivity reflects coding: a model of voltage-based STDP with home- ostasis”. In: Nature Neuroscience 13.3 (2010), pp. 344–52. [114] Jo˜aoSacramento et al. “Dendritic cortical microcircuits approximate the backpropagation algo- rithm”. In: Advances in Neural Information Processing Systems. 2018, pp. 8721–8732. [115] Richard Naud and Henning Sprekeler. “Sparse bursts optimize information transmission in a multiplexed neural code”. In: Proceedings of the National Academy of Sciences 115.27 (2018), E6329–E6338. [116] Benjamin James Lansdell and Konrad Paul K¨ording.“Spiking allows neurons to estimate their causal effect”. In: bioRxiv (2019), p. 253351. [117] Yoshua Bengio and Yann LeCun. “Scaling learning algorithms towards AI”. In: Large-scale Kernel Machines 34.5 (2007), pp. 1–41. [118] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. In: Nature 521.7553 (May 28, 2015), pp. 436–444. issn: 0028-0836. url: http://dx.doi.org/10.1038/nature14539. [119] Volodymyr Mnih et al. “Human-level control through deep reinforcement learning”. In: Nature 518.7540 (Feb. 26, 2015), pp. 529–533. issn: 0028-0836. url: http://dx.doi.org/10.1038/ nature14236. [120] David Silver et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587 (Jan. 28, 2016), pp. 484–489. issn: 0028-0836. url: http://dx.doi.org/10. 1038/nature16961. [121] Kaiming He et al. “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification”. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, pp. 1026–1034. BIBLIOGRAPHY 109

[122] Adam H. Marblestone, Greg Wayne, and Konrad P. K¨ording.“Toward an integration of deep learning and neuroscience”. In: Frontiers in Computational Neuroscience 10 (2016), p. 94. issn: 1662-5188. doi: 10.3389/fncom.2016.00094. url: https://www.frontiersin.org/article/ 10.3389/fncom.2016.00094. [123] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J Williams. “Learning representations by back-propagating errors”. In: Nature 323.6088 (1986), pp. 533–536. [124] Stephen J. Martin, Paul D. Grimwood, and Richard G. M. Morris. “Synaptic plasticity and memory: an evaluation of the hypothesis”. In: Annual Review of Neuroscience 23.1 (Mar. 2000), pp. 649–711. issn: 0147-006X. doi: doi : 10 . 1146 / annurev . neuro . 23 . 1 . 649. url: http : //dx.doi.org/10.1146/annurev.neuro.23.1.649. [125] Joel Zylberberg, Jason Timothy Murphy, and Michael Robert DeWeese. “A sparse coding model with synaptically local plasticity and spiking neurons can account for the diverse shapes of V1 simple cell receptive fields”. In: PLOS Computational Biology 7.10 (Oct. 2011), e1002250. doi: 10.1371/journal.pcbi.1002250. url: http://dx.doi.org/10.1371%2Fjournal.pcbi. 1002250. [126] Joel Z. Leibo et al. “View-tolerant face recognition and Hebbian learning imply mirror-symmetric neural tuning to head orientation”. In: Current Biology 27.1 (Jan. 2017), pp. 62–67. issn: 0960- 9822. doi: 10.1016/j.cub.2016.10.015. url: //www.sciencedirect.com/science/article/ pii/S0960982216312003. [127] Stephen Grossberg. “Competitive learning: From interactive activation to adaptive resonance”. In: Cognitive science 11.1 (1987), pp. 23–63. issn: 1551-6709. [128] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. “A fast learning algorithm for deep belief nets”. In: Neural Computation 18.7 (May 2006), pp. 1527–1554. issn: 0899-7667. doi: 10.1162/neco.2006.18.7.1527. url: http://dx.doi.org/10.1162/neco.2006.18.7.1527 (visited on 10/07/2013). [129] Kenneth D. Harris. “Stability of the fittest: organizing learning through retroaxonal signals”. In: Trends in Neurosciences 31.3 (2008), pp. 130–136. issn: 0166-2236. doi: 10.1016/j.tins.2007. 12.002. url: http://dx.doi.org/10.1016/j.tins.2007.12.002 (visited on 03/26/2017). [130] Robert Urbanczik and Walter Senn. “Reinforcement learning in populations of spiking neurons”. In: Nature Neuroscience 12.3 (2009), p. 250. issn: 1097-6256. [131] Julian M. L. Budd. “Extrastriate feedback to primary visual cortex in primates: a quantitative analysis of connectivity”. In: Proceedings of the Royal Society of London B: Biological Sciences 265.1400 (June 1998), pp. 1037–1044. doi: 10.1098/rspb.1998.0396. url: http://rspb. royalsocietypublishing.org/content/265/1400/1037.abstract. [132] Michael W. Spratling. “Cortical region interactions and the functional role of apical dendrites”. In: Behavioral and Cognitive Neuroscience Reviews 1.3 (Sept. 2002), pp. 219–228. doi: 10.1177/ 1534582302001003003. url: http://bcn.sagepub.com/content/1/3/219.abstract. [133] Matthew Evan Larkum et al. “Dendritic Spikes in Apical Dendrites of Neocortical Layer 2/3 Pyramidal Neurons”. In: The Journal of Neuroscience 27.34 (Aug. 22, 2007), pp. 8999–9008. doi: 10.1523/JNEUROSCI.1717-07.2007. url: http://www.jneurosci.org/content/27/34/8999. abstract. BIBLIOGRAPHY 110

[134] Konrad P. K¨ordingand Peter K¨onig.“Supervised and unsupervised learning with two sites of synaptic integration”. In: Journal of Computational Neuroscience 11.3 (2001), pp. 207–215. [135] Michael W. Spratling and Mark H. Johnson. “A feedback model of perceptual learning and categorization”. In: Visual Cognition 13.2 (Jan. 2006), pp. 129–165. issn: 1350-6285. doi: 10. 1080/13506280500168562. url: http://dx.doi.org/10.1080/13506280500168562. [136] Katie C Bittner et al. “Conjunctive input processing drives feature selectivity in hippocampal CA1 neurons”. In: Nature Neuroscience 18.8 (Aug. 2015), pp. 1133–1142. issn: 1097-6256. url: http://dx.doi.org/10.1038/nn.4062. [137] Katie C. Bittner et al. “Behavioral time scale synaptic plasticity underlies CA1 place fields”. In: Science 357.6355 (2017), pp. 1033–1036. [138] Vikram Gadagkar et al. “Dopamine neurons encode performance error in singing birds”. In: Science 354.6317 (Dec. 2016), p. 1278. doi: 10.1126/science.aah6837. url: http://science. sciencemag.org/content/354/6317/1278.abstract. [139] Masanori Murayama et al. “Dendritic encoding of sensory stimuli controlled by deep cortical interneurons”. In: Nature 457.7233 (Feb. 2009), pp. 1137–1141. issn: 0028-0836. doi: 10.1038/ nature07663. url: http://dx.doi.org/10.1038/nature07663. [140] Markus M. Hilscher et al. “Chrna2-Martinotti Cells Synchronize Layer 5 Type A Pyramidal Cells via Rebound Excitation”. In: PLOS Biology 15.2 (Feb. 2017), pp. 1–26. doi: 10.1371/journal. pbio.2001392. url: https://doi.org/10.1371/journal.pbio.2001392. [141] Hyun-Jae Pi et al. “Cortical interneurons that specialize in disinhibitory control”. In: Nature 503.7477 (Nov. 2013), pp. 521–524. issn: 0028-0836. url: http : / / dx . doi . org / 10 . 1038 / nature12676. [142] Carsten K. Pfeffer et al. “Inhibition of inhibition in visual cortex: the logic of connections between molecularly distinct interneurons”. In: Nature Neuroscience 16.8 (Aug. 2013), pp. 1068–1076. issn: 1097-6256. url: http://dx.doi.org/10.1038/nn.3446. [143] Bal´azsHangya et al. “Central Cholinergic Neurons Are Rapidly Recruited by Reinforcement Feedback”. In: Cell 162.5 (2015), pp. 1155–1168. doi: 10.1016/j.cell.2015.07.057. url: http://dx.doi.org/10.1016/j.cell.2015.07.057 (visited on 08/28/2015). [144] Arne Brombas, Lee N. Fletcher, and Stephen R. Williams. “Activity-dependent modulation of layer 1 inhibitory neocortical circuits by acetylcholine”. In: The Journal of Neuroscience 34.5 (Jan. 2014), p. 1932. doi: 10.1523/JNEUROSCI.4470-13.2014. url: http://www.jneurosci. org/content/34/5/1932.abstract. [145] Gy¨o”rgyBuzs´akiand Andreas Draguhn. “Neuronal oscillations in cortical networks”. In: Science 304.5679 (June 2004), pp. 1926–1929. doi: 10.1126/science.1099745. url: http://www. sciencemag.org/cgi/content/abstract/304/5679/1926. [146] Laurens van der Maaten and Geoffrey Hinton. “Visualizing data using t-SNE”. In: Journal of Machine Learning Research 9.Nov (2008), pp. 2579–2605. [147] Alex Krizhevsky, , and Geoffrey E Hinton. “Imagenet classification with deep convolutional neural networks”. In: Advances in Neural Information Processing Systems. 2012, pp. 1097–1105. BIBLIOGRAPHY 111

[148] David Daniel Cox and Thomas Dean. “Neural Networks and Neuroscience-Inspired Computer Vision”. In: Current Biology 24.18 (Sept. 2014), R921–R929. issn: 0960-9822. doi: 10.1016/ j . cub . 2014 . 08 . 026. url: http : / / www . sciencedirect . com / science / article / pii / S0960982214010392. [149] Charles D. Gilbert and Wu Li. “Top-down influences on visual processing”. In: Nature Reviews Neuroscience 14.5 (May 2013), pp. 350–363. issn: 1471-003X. url: http://dx.doi.org/10. 1038/nrn3476. [150] Siyu Zhang et al. “Long-range and local circuits for top-down modulation of visual cortex pro- cessing”. In: Science 345.6197 (Aug. 2014), pp. 660–665. doi: 10.1126/science.1254126. url: http://www.sciencemag.org/content/345/6197/660.abstract. [151] Aris Fiser et al. “Experience-dependent spatial expectations in mouse visual cortex”. In: Nature Neuroscience 19.12 (Dec. 2016), pp. 1658–1664. issn: 1097-6256. url: http://dx.doi.org/10. 1038/nn.4385. [152] Marcus Leinweber et al. “A sensorimotor circuit in mouse cortex for visual flow predictions”. In: Neuron 95.6 (), 1420–1432.e5. issn: 0896-6273. doi: 10.1016/j.neuron.2017.08.036. url: http://dx.doi.org/10.1016/j.neuron.2017.08.036 (visited on 09/16/2017). [153] Naoya Takahashi et al. “Active cortical dendrites modulate perception”. In: Science 354.6319 (Dec. 2016), p. 1587. doi: 10.1126/science.aah6066. url: http://science.sciencemag.org/ content/354/6319/1587.abstract. [154] Andrew D. Thompson et al. “Cortical feedback regulates feedforward retinogeniculate refine- ment”. In: Neuron 91.5 (2016), pp. 1021–1033. issn: 0896-6273. doi: 10.1016/j.neuron.2016. 07.040. url: http://dx.doi.org/10.1016/j.neuron.2016.07.040 (visited on 09/07/2016). [155] Yoshiyuki Yamada et al. “Context- and output layer-dependent long-term ensemble plasticity in a sensory circuit”. In: Neuron 93.5 (2017), 1198–1212.e5. issn: 0896-6273. doi: 10.1016/j. neuron.2017.02.006. url: http://dx.doi.org/10.1016/j.neuron.2017.02.006 (visited on 03/09/2017). [156] Julia Veit et al. “Cortical gamma band synchronization through somatostatin interneurons”. In: Nature Neuroscience 20.7 (July 2017), pp. 951–959. issn: 1097-6256. url: http://dx.doi.org/ 10.1038/nn.4562. [157] Bj¨ornM. Kampa and Greg J. Stuart. “Calcium spikes in basal dendrites of layer 5 pyramidal neurons during action potential bursts”. In: Journal of Neuroscience 26.28 (2006), pp. 7424–32. [158] Gilad Silberberg and Henry Markram. “Disynaptic inhibition between neocortical pyramidal cells mediated by Martinotti cells”. In: Neuron 53.5 (Mar. 2007), pp. 735–746. issn: 0896-6273. doi: 10.1016/j.neuron.2007.02.012. url: http://www.sciencedirect.com/science/article/ pii/S0896627307001110. [159] William Mu˜nozet al. “Layer-specific modulation of neocortical dendritic inhibition during active wakefulness”. In: Science 355.6328 (Mar. 2017), p. 954. doi: 10.1126/science.aag2599. url: http://science.sciencemag.org/content/355/6328/954.abstract. [160] Yann LeCun et al. “Gradient-based learning applied to document recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324. BIBLIOGRAPHY 112

[161] Yang Li et al. “Very Deep Neural Network for Handwritten Digit Recognition”. In: International Conference on Intelligent Data Engineering and Automated Learning. Springer. 2016, pp. 174– 182. [162] Ilya Sutskever et al. “On the importance of initialization and momentum in deep learning.” In: International Conference on Machine Learning 28 (2013), pp. 1139–1147. [163] Tijmen Tieleman and Geoffrey Hinton. “Lecture 6.5 - RMSProp: Divide the gradient by a running average of its recent magnitude”. In: COURSERA: Neural Networks for Machine Learning 4.2 (2012). [164] Nitish Srivastava et al. “Dropout: A simple way to prevent neural networks from overfitting”. In: The Journal of Machine Learning Research 15.1 (2014), pp. 1929–1958. issn: 1532-4435. [165] Per Jesper Sj¨ostr¨omand Michael H¨ausser. “A cooperative switch determines the sign of synaptic plasticity in distal dendrites of neocortical pyramidal neurons”. In: Neuron 51.2 (July 2006), pp. 227–238. issn: 0896-6273. doi: 10 . 1016 / j . neuron . 2006 . 06 . 017. url: http : / / www . sciencedirect.com/science/article/pii/S0896627306004715. [166] Kendra S. Burbank and Gabriel Kreiman. “Depression-biased reverse plasticity rule is required for stable learning at top-down connections”. In: PLoS Computational Biology 8.3 (2012), e1002393. [167] Kendra S. Burbank. “Mirrored STDP implements autoencoder learning in a network of spiking neurons”. In: PLoS Computational Biology 11.12 (2015), e1004566. [168] Tiberiu Te¸sileanu, Bence Olveczky,¨ and Vijay Balasubramanian. “Rules and mechanisms for efficient two-stage learning in neural circuits”. In: eLife 6 (Apr. 2017). Ed. by Michael J Frank, e20944. issn: 2050-084X. doi: 10.7554/eLife.20944. url: https://dx.doi.org/10.7554/ eLife.20944. [169] Jordan Guerguiev. Segregated-Dendrite-Deep-Learning. Github. https://github.com/jordan- g/Segregated-Dendrite-Deep-Learning. 23f2c66. 2017. [170] Chris Loken et al. “SciNet: Lessons Learned from Building a Power-efficient Top-20 System and Data Centre”. In: Journal of Physics: Conference Series 256.1 (2010), p. 012026. url: http: //stacks.iop.org/1742-6596/256/i=1/a=012026. [171] Donald O. Hebb. The Organization of Behavior. New York: Wiley, 1949. [172] Alain Artola, S Br¨ocher, and Wolf Singer. “Different voltage-dependent thresholds for inducing long-term depression and long-term potentiation in slices of rat visual cortex”. In: Nature 347.6288 (1990), pp. 69–72. [173] Henry Markram et al. “Regulation of Synaptic Efficacy by Coincidence of Postsynaptic APs and EPSPs”. In: Science 275 (1997), pp. 213–215. [174] Daniel E. Feldman. “Timing-based LTP and LTD and vertical inputs to layer II/III pyramidal cells in rat barrel cortex”. In: Neuron 27 (2000), pp. 45–56. [175] Per Jesper Sj¨ostr¨omand Michael H¨ausser. “A cooperative switch determines the sign of synaptic plasticity in distal dendrites of neocortical pyramidal neurons”. In: Neuron 51.2 (2006), pp. 227– 238. [176] Eugene M. Izhikevich. “Solving the Distal Reward Problem through Linkage of STDP and Dopamine Signaling”. In: Cerebral Cortex 17 (2007), pp. 2443–2452. BIBLIOGRAPHY 113

[177] Geun Hee Seol et al. “Neuromodulators control the polarity of spike-timing-dependent synaptic plasticity”. In: Neuron 55.6 (2007), pp. 919–929. [178] Robert Legenstein, Dejan Pecevski, and Wolfgang Maass. “A Learning Theory for Reward- Modulated Spike-Timing-Dependent Plasticity with Application to Biofeedback”. In: PLOS Com- putational Biology 4 (2008), e1000180. [179] Nicolas Fr´emaux,Henning Sprekeler, and Wulfram Gerstner. “Functional requirements for reward- modulated spike-timing-dependent plasticity”. In: Journal of Neuroscience 30.40 (2010), pp. 13326– 13337. [180] Johannes Friedrich and M´at´eLengyel. “Goal-directed decision making with spiking neurons”. In: Journal of Neuroscience 36.5 (2016), pp. 1529–1546. [181] H. Francis Song, Guangyu R. Yang, and Xiao-Jing Wang. “Reward-based training of recurrent neural networks for cognitive and value-based tasks”. In: eLife 6 (2017), e21492. [182] Thomas Miconi. “Biologically plausible learning in recurrent neural networks reproduces neural dynamics observed during cognitive tasks”. In: eLife 6 (2017), e20899. [183] Johnatan Aljadeff et al. Cortical credit assignment by Hebbian, neuromodulatory and inhibitory plasticity. 2019. arXiv: 1911.00307 [q-bio.NC]. [184] Wulfram Gerstner et al. “Eligibility traces and plasticity on behavioral time scales: experimental support of neoHebbian three-factor learning rules”. In: Frontiers in Neural Circuits 12 (2018). [185] Ronald J. Williams. “Simple statistical gradient-following algorithms for connectionist reinforce- ment learning”. In: Machine learning 8.3-4 (1992), pp. 229–256. [186] Guillaume Bellec et al. “A solution to the learning dilemma for recurrent networks of spiking neurons”. In: bioRxiv (2020). doi: 10 . 1101 / 738385. eprint: https : / / www . biorxiv . org / content/early/2020/04/16/738385.full.pdf. url: https://www.biorxiv.org/content/ early/2020/04/16/738385. [187] Timothy P. Lillicrap et al. “Backpropagation and the brain”. In: Nature Reviews Neuroscience (2020), pp. 1–12. [188] David E. Rumelhard, James L. McClelland, and the PDP research group. Parallel distributed processing: Explorations in the microstructure of cognition. Vol. 1: Foundations. MIT Press, 1986. [189] Arash Samadi, Timothy P. Lillicrap, and Douglas B. Tweed. “Deep learning with dynamic spiking neurons and fixed feedback weights”. In: Neural Computation 29.3 (2017), pp. 578–602. [190] James C. R. Whittington and Rafal Bogacz. “Theories of error back-propagation in the brain”. In: Trends in Cognitive Sciences 23.3 (2019), pp. 235–250. [191] Hesham Mostafa, Vishwajith Ramesh, and Gert Cauwenberghs. “Deep supervised learning using local errors”. In: Frontiers in Neuroscience 12 (2018), p. 608. [192] Benjamin James Lansdell, Prashanth Ravi Prakash, and Konrad Paul K¨ording. Learning to solve the credit assignment problem. 2020. arXiv: 1906.00889 [q-bio.NC]. [193] Isabella Pozzi, Sander Boht´e,and Pieter Roelfsema. A Biologically Plausible Learning Rule for Deep Learning in the Brain. 2019. arXiv: 1811.01768 [cs.NE]. BIBLIOGRAPHY 114

[194] Alex Reyes et al. “Target-cell-specific facilitation and depression in neocortical circuits”. In: Na- ture Neuroscience 1.4 (1998), pp. 279–285. [195] Henry Markram, Yun Wang, and Misha Tsodyks. “Differential signaling via the same axon of neocortical pyramidal neurons”. In: Proceedings of the National Academy of Sciences 95.9 (1998), pp. 5323–5328. [196] Thomas Nevian and Bert Sakmann. “Spine Ca2+ Signaling in Spike-Timing-Dependent Plastic- ity”. In: Journal of Neuroscience 26.43 (2006), pp. 11001–11013. eprint: http://www.jneurosci. org/cgi/reprint/26/43/11001.pdf. [197] Robert C. Froemke et al. “Contribution of individual spikes in burst-induced long-term synaptic modification”. In: Journal of Neurophysiology 95.3 (2006), pp. 1620–1629. [198] Curtis C. Bell et al. “Storage of a sensory pattern by anti-Hebbian synaptic plasticity in an electric fish.” In: Proceedings of the National Academy of Sciences 90.10 (1993), pp. 4650–4654. [199] Kieran Bol et al. “Frequency-tuned cerebellar channels and burst-induced LTD lead to the can- cellation of redundant sensory inputs”. In: Journal of Neuroscience 31.30 (2011), pp. 11028– 11038. [200] Salomon Z. Muller et al. “Continual learning in a multi-layer network of an electric fish”. In: Cell 179.6 (2019), pp. 1382–1392. [201] Guy Bouvier et al. “Burst-dependent bidirectional plasticity in the cerebellum is driven by presy- naptic NMDA receptors”. In: Cell Reports 15.1 (2016), pp. 104–116. [202] Blake A. Richards and Timothy P. Lillicrap. “Dendritic solutions to the credit assignment prob- lem”. In: Current Opinion in Neurobiology 54 (2019), pp. 28–36. [203] Federico Brandalise and Urs Gerber. “Mossy fiber-evoked subthreshold responses induce timing- dependent plasticity at hippocampal CA3 recurrent synapses”. In: Proceedings of the National Academy of Sciences 111.11 (2014), pp. 4303–4308. [204] Christoph Kayser et al. “Spike-phase coding boosts and stabilizes information carried by spatial and temporal spike patterns”. In: Neuron 61.4 (2009), pp. 597–608. [205] Thomas Akam and Dimitri M. Kullmann. “Oscillatory multiplexing of population codes for se- lective communication in the mammalian brain”. In: Nature Reviews Neuroscience 15.2 (2014), p. 111. [206] David J. Herzfeld et al. “Encoding of action by the Purkinje cells of the cerebellum”. In: Nature 526.7573 (2015), p. 439. [207] Masanori Murayama et al. “Dendritic encoding of sensory stimuli controlled by deep cortical interneurons”. In: Nature 457.7233 (2009), pp. 1137–1141. [208] Bj¨ornGranseth, Erik Ahlstrand, and Sivert Lindstr¨om.“Paired pulse facilitation of corticogenic- ulate EPSCs in the dorsal lateral geniculate nucleus of the rat investigated in vitro”. In: Journal of Physiology 544.2 (2002), pp. 477–486. [209] S. Murray Sherman. “Thalamocortical interactions”. In: Current Opinion in Neurobiology 22.4 (2012), pp. 575–579. [210] Wulfram Gerstner et al. “A neuronal learning rule for sub-millisecond temporal coding”. In: Nature 383.6595 (1996), pp. 76–78. BIBLIOGRAPHY 115

[211] Guo-qiang Bi and Mu-ming Poo. “Synaptic Modifications in Cultured Hippocampal Neurons: Dependence on Spike Timing, Synaptic Strength, and Postsynaptic Cell Type”. In: Journal of Neuroscience 18.24 (1998), pp. 10464–10472. [212] Rhiannon M. Meredith, Anna M. Floyer-Lea, and Ole Paulsen. “Maturation of long-term po- tentiation induction rules in rodent hippocampus: role of GABAergic inhibition.” In: Journal of Neuroscience 23.35 (2003), pp. 11142–11146. [213] Yanis Inglebert et al. “Altered spike timing-dependent plasticity rules in physiological calcium”. In: bioRxiv (2020). doi: 10.1101/2020.03.16.993675. eprint: https://www.biorxiv.org/ content/early/2020/03/17/2020.03.16.993675.full.pdf. url: https://www.biorxiv. org/content/early/2020/03/17/2020.03.16.993675. [214] Guy Doron et al. “Perirhinal input to neocortical layer 1 controls learning”. In: bioRxiv (2019). doi: 10.1101/713883. eprint: https://www.biorxiv.org/content/early/2019/07/25/ 713883.full.pdf. url: https://www.biorxiv.org/content/early/2019/07/25/713883. [215] C. P. J. De Kock and Bert Sakmann. “High frequency action potential bursts (> 100 Hz) in L2/3 and L5B thick tufted neurons in anaesthetized and awake rat primary somatosensory cortex”. In: Journal of Physiology 586.14 (2008), pp. 3353–3364. [216] Friedemann Zenke, Wulfram Gerstner, and Surya Ganguli. “The temporal paradox of Hebbian learning and homeostatic plasticity”. In: Current Opinion in Neurobiology 43 (2017), pp. 166–176. [217] Tuomo M¨aki-Marttunenet al. “A unified computational model for cortical post-synaptic plastic- ity”. In: eLife 9 (2020). [218] Jean-Pascal Pfister and Wulfram Gerstner. “Triplets of spikes in a model of spike timing-dependent plasticity”. In: Journal of Neuroscience (Jan. 2006). [219] Elie L. Bienenstock, Leon N. Cooper, and Paul W. Munro. “Theory for the Development of Neuron Selectivity: Orientation Specificity and Binocular Interaction in Visual Cortex”. In: Journal of Neuroscience 2.1 (1982), pp. 32–48. [220] Matthew E. Larkum and J. Julius Zhu. “Signaling of layer 1 and whisker-evoked Ca2+ and Na+ action potentials in distal and terminal dendrites of rat neocortical pyramidal neurons in vitro and in vivo”. In: Journal of Neuroscience 22.16 (2002), pp. 6991–7005. [221] Ning-long Xu et al. “Nonlinear dendritic integration of sensory and motor input during an active sensing task”. In: Nature 492.7428 (2012), pp. 247–251. [222] Lee N. Fletcher and Stephen R. Williams. “Neocortical Topology Governs the Dendritic Integra- tive Capacity of Layer 5 Pyramidal Neurons”. In: Neuron 101.1 (2019), pp. 76–90. [223] Larry Cauller. “Layer I of primary sensory neocortex: where top-down converges upon bottom- up”. In: Behavioural Brain Research 71.1 (1995), pp. 163–170. [224] Daniel J. Felleman and David C. Van Essen. “Distributed Hierarchical Processing in the Primate Cerebral Cortex”. In: Cerebral Cortex 1.1 (2001), pp. 1–47. [225] Yun Wang et al. “Anatomical, physiological, molecular and circuit properties of nest basket cells in the developing somatosensory cortex”. In: Cerebral Cortex 12 (2002), pp. 395–410. [226] Simon X. Chen et al. “Subtype-specific plasticity of inhibitory circuits in motor cortex during motor learning”. In: Nature Neuroscience 18.8 (2015), pp. 1109–1115. BIBLIOGRAPHY 116

[227] Sabine Krabbe et al. “Adaptive disinhibitory gating by VIP interneurons permits associative learning”. In: Nature Neuroscience (2019), pp. 1–10. [228] Yan Yang and Stephen G. Lisberger. “Purkinje-cell plasticity and cerebellar motor learning are graded by complex-spike duration”. In: Nature 510.7506 (2014), p. 529. [229] Vinay Parikh et al. “Prefrontal acetylcholine release controls cue detection on multiple timescales”. In: Neuron 56.1 (2007), pp. 141–154. [230] Aleksey V. Zaitsev and Roger Anwyl. “Inhibition of the slow afterhyperpolarization restores the classical spike timing-dependent plasticity rule obeyed in layer 2/3 pyramidal cells of the prefrontal cortex”. In: Journal of Neurophysiology 107.1 (2011), pp. 205–215. [231] Alfonso Renart, Nicolas Brunel, and Xiao-Jing Wang. “Mean-field theory of irregularly spiking neuronal populations and working memory in recurrent cortical networks”. In: Computational Neuroscience: A Comprehensive Approach (2004), pp. 431–490. [232] Olivier D. Faugeras, Jonathan D. Touboul, and Bruno Cessac. “A constructive mean-field analysis of multi population neural networks with random synaptic weights and stochastic inputs”. In: Frontiers in Computational Neuroscience 3 (2009), p. 1. [233] Tilo Schwalger, Moritz Deger, and Wulfram Gerstner. “Towards a theory of cortical columns: From spiking neurons to interacting neural populations of finite size”. In: PLoS Computational Biology 13.4 (2017), e1005507. [234] Xin Wang et al. “Feedforward excitation and inhibition evoke dual modes of firing in the cat’s visual thalamus during naturalistic viewing”. In: Neuron 55.3 (2007), pp. 465–478. [235] Scott F. Owen, Joshua D. Berke, and Anatol C. Kreitzer. “Fast-spiking interneurons supply feedforward control of bursting, calcium, and plasticity for efficient learning”. In: Cell 172.4 (2018), pp. 683–695. [236] Brent Doiron et al. “The mechanics of state-dependent neural correlations”. In: Nature Neuro- science 19.3 (2016), p. 383. [237] Maya Sandler, Yoav Shulman, and Jackie Schiller. “A novel form of local plasticity in tuft den- drites of neocortical somatosensory layer 5 pyramidal neurons”. In: Neuron 90.5 (2016), pp. 1028– 1042. [238] Martin Boerlin, Christian K Machens, and Sophie Den`eve. “Predictive coding of dynamical vari- ables in balanced spiking networks”. In: PLoS Computational Biology 9.11 (2013), e1003258. [239] Pieter R. Roelfsema and Arjen van Ooyen. “Attention-gated reinforcement learning of internal representations for classification”. In: Neural Computation 17.10 (2005), pp. 2176–2214. [240] Grace W. Lindsay and Kenneth D. Miller. “How biological attention mechanisms improve task performance in a large-scale visual system model”. In: eLife 7 (2018), e38105. [241] Naoya Takahashi et al. “Active cortical dendrites modulate perception”. In: Science 354.6319 (2016), pp. 1587–1590. issn: 0036-8075. doi: 10.1126/science.aah6066. [242] Thilo Womelsdorf et al. “Burst firing synchronizes prefrontal and anterior cingulate cortex during attentional control”. In: Current Biology 24.22 (2014), pp. 2613–2621. [243] Leopoldo Petreanu et al. “The subcellular organization of neocortical excitatory connections”. In: Nature 457.7233 (Feb. 2009), pp. 1142–5. BIBLIOGRAPHY 117

[244] Si-Qiang Ren et al. “Precise Long-Range Microcircuit-to-Microcircuit Communication Connects the Frontal and Sensory Cortices in the Mammalian Brain”. In: Neuron 104.2 (2019), 385–401.e3. issn: 0896-6273. doi: https : / / doi . org / 10 . 1016 / j . neuron . 2019 . 06 . 028. url: http : //www.sciencedirect.com/science/article/pii/S0896627319305987. [245] Federico W Grillo et al. “A distance-dependent distribution of presynaptic boutons tunes frequency- dependent dendritic integration”. In: Neuron 99.2 (2018), pp. 275–282. [246] Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Under- standing. 2019. arXiv: 1810.04805 [cs.CL]. [247] Tengda Han, Weidi Xie, and Andrew Zisserman. “Video representation learning by dense predic- tive coding”. In: Proceedings of the IEEE International Conference on Computer Vision Work- shops. 2019, pp. 0–0. [248] Nace L. Golding, Nathan P. Staff, and Nelson Spruston. “Dendritic spikes as a mechanism for cooperative long-term potentiation”. In: Nature 418.6895 (2002), pp. 326–331. issn: 0028-0836 (Print). [249] Jacopo Bono and Claudia Clopath. “Modeling somatic and dendritic spike mediated plasticity at the single neuron and network level”. In: Nature Communications 8.1 (2017), p. 706. [250] Friedemann Zenke and Wulfram Gerstner. “Limits to high-speed simulations of spiking neural networks using general-purpose computers”. In: Frontiers in Neuroinformatics 8 (2014), p. 76. [251] Joseph Bastian and Jerry Nguyenkim. “Dendritic modulation of burst-like firing in sensory neu- rons”. In: Journal of Neurophysiology 85.1 (2001), pp. 10–22. [252] Robin Tremblay, Soohyun Lee, and Bernardo Rudy. “GABAergic interneurons in the neocortex: from cellular properties to circuits”. In: Neuron 91.2 (2016), pp. 260–292. [253] Maximiliano Jos´eNigro, Yoshiko Hashikawa-Yamasaki, and Bernardo Rudy. “Diversity and Con- nectivity of Layer 5 Somatostatin-Expressing Interneurons in the Mouse Barrel Cortex”. In: Jour- nal of Neuroscience 38.7 (2018), pp. 1622–1633. issn: 0270-6474. doi: 10.1523/JNEUROSCI.2415- 17.2017. eprint: https://www.jneurosci.org/content/38/7/1622.full.pdf. url: https: //www.jneurosci.org/content/38/7/1622. [254] Richard Naud et al. “Firing patterns in the adaptive exponential integrate-and-fire model”. In: Biological Cybernetics 99 (2008), pp. 335–347. [255] Adam M. Packer and Rafael Yuste. “Dense, unspecific connectivity of neocortical parvalbumin- positive interneurons: a canonical microcircuit for inhibition?” In: Journal of Neuroscience 31.37 (2011), pp. 13260–13271. [256] Rui P. Costa, P. Jesper Sj¨ostr¨om,and Mark C. W. Van Rossum. “Probabilistic inference of short-term synaptic plasticity in neocortical microcircuits”. In: Frontiers in Computational Neu- roscience 7 (2013). [257] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 (Canadian Institute for Advanced Research). Tech. rep. University of Toronto, 2009. url: http://www.cs.toronto.edu/~kriz/ cifar.html. [258] Jia Deng et al. “ImageNet: A Large-Scale Hierarchical Image Database”. In: 2009 IEEE Confer- ence on Computer Vision and Pattern Recognition. 2009. BIBLIOGRAPHY 118

[259] Xavier Glorot and Yoshua Bengio. “Understanding the difficulty of training deep feedforward neu- ral networks”. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 2010, pp. 249–256. [260] Reza Shadmehr, Maurice A. Smith, and John W. Krakauer. “Error correction, sensory prediction, and adaptation in motor control”. In: Annual Review of Neuroscience 33 (2010), pp. 89–108. issn: 0147-006X. [261] Dhruva Venkita Raman, Adriana Perez Rotondo, and Timothy O’Leary. “Fundamental bounds on learning performance in neural circuits”. In: Proceedings of the National Academy of Sciences 116.21 (2019), pp. 10537–10546. issn: 0027-8424. [262] Joshua D. Angrist and J¨orn-SteffenPischke. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press, 2008. [263] Ioana E. Marinescu, Patrick N. Lawlor, and Konrad P. K¨ording.“Quasi-experimental causality in neuroscience and behavioural research”. In: Nature Human Behaviour (2018), p. 1. [264] Piergiorgio Strata and Robin Harvey. “Dale’s principle”. In: Brain Research Bulletin 50.5 (Nov. 1, 1999), pp. 349–350. issn: 0361-9230. doi: 10 . 1016 / S0361 - 9230(99 ) 00100 - 8. url: http : //www.sciencedirect.com/science/article/pii/S0361923099001008. [265] Nasir Ahmad, Luca Ambrogioni, and Marcel A. J. van Gerven. Overcoming the Weight Transport Problem via Spike-Timing-Dependent Weight Inference. 2020. arXiv: 2003.03988 [q-bio.NC]. [266] Jeehyun Kwag and Ole Paulsen. “The timing of external input controls the sign of plasticity at local synapses”. In: Nature Neuroscience 12.10 (2009), pp. 1219–1221. [267] Yanping Huang and Rajesh P. N. Rao. “Predictive coding”. In: WIREs Cognitive Science 2.5 (2011), pp. 580–593. issn: 1939-5086. doi: 10.1002/wcs.142. [268] Jo˜aoSacramento et al. “Dendritic cortical microcircuits approximate the backpropagation algo- rithm”. In: Advances in Neural Information Processing Systems. 2018, pp. 8735–8746. [269] Arild Nøkland. “Direct feedback alignment provides learning in deep neural networks”. In: Ad- vances in Neural Information Processing Systems. 2016, pp. 1037–1045. [270] R Naud and W Gerstner. “Computational Systems Neurobiology”. In: Springer, 2012. Chap. The Performance (and limits) of Simple Neuron Models: Generalizations of the Leaky Integrate-and- Fire Model. [271] Srdjan Ostojic and Nicolas Brunel. “From spiking neuron models to linear-nonlinear models”. In: PLoS Computational Biology 7.1 (2011), e1001056. doi: 10.1371/journal.pcbi.1001056. Appendix A

Appendix for Project 1

A.1 Proofs

A.1.1 Theorem for loss function coordination

The targets that we selected for the hidden layer (see equation (3.8)) were based on the targets used in [110]. The authors of that paper provided a proof showing that their hidden layer targets guaranteed that learning in one layer helped reduce the error in the next layer. However, there were a number of differences between our network and theirs, such as the use of spiking neurons, voltages, different compartments, etc. Here, we modify the original [110] proof slightly to prove Theorem A.1. One important thing to note is that the theorem given here utilizes a target for the hidden layer that is slightly different than the one defined in equation (3.8). However, the target defined in equation (3.8) is a numerical approximation of the target given in Theorem A.1. After the proof of we describe exactly how these approximations relate to the targets given here.

Theorem A.1. Consider a neural network with one hidden layer and an output layer. Let φ˜0∗ = f t f 0 1 1 φ + σ(Y φ ) − σ(Y φmaxσ(E[V ])) be the target firing rates for neurons in the hidden layer, where f f t 1 1b 1∗ 1 σ(·) is a differentiable function. Assume that V ≈ kdV . Let φ = φ be the target firing rates 1 for the output layer. Also, for notational simplicity, let β(x) ≡ φmaxσ(kdW x) and γ(x) ≡ σ(Y x). f 1∗ 1 Theorem 1 states that if φ − φmaxσ(E[V ]) is sufficiently small, and the Jacobian matrices Jβ and T Jγ satisfy the condition that the largest eigenvalue of (I − JβJγ ) (I − JβJγ ) is less than 1, then

f 1∗ 1 ˜0∗ 2 1∗ 1 2 ||φ − φmaxσ(kdW φ )||2 < ||φ − φmaxσ(E[V ])||2

We note again that the proof for this theorem is essentially a modification of the proof provided in f [110] that incorporates our Lemma A.1 to take into account the expected value of s0 , given that spikes in the network are generated with non-stationary Poisson processes.

Proof.

1∗ 1 ˜0∗ 1∗ ˜0∗ φ − φmaxσ(kdW φ ) ≡ φ − β(φ ) f t f 1∗ 0 1 1 = φ − β(φ + γ(φ ) − γ(φmaxσ(E[V ])))

119 Appendix A. Appendix for Project 1 120

f f f 1 1 0 1 0 Lemma 1 shows that φmaxσ(E[V ]) = φmaxσ(E[kdW s ]) ≈ φmaxσ(kdW φ ) given a sufficiently f f f 1 1 0 0 large averaging time window. Assume that φmaxσ(E[V ]) = φmaxσ(kdW φ ) ≡ β(φ ). Then,

f t f φ1∗ − β(φ˜0∗) = φ1∗ − β(φ0 + γ(φ1 ) − γ(β(φ0 )))

t f Let e = φ1 − β(φ0 ). Applying Taylor’s theorem,

f 1∗ ˜0∗ 1∗ 0 φ − β(φ ) = φ − β(φ + Jγ e + o(||e||2))

where o(||e||2) is the remainder term that satisfies lime→0 o(||e||2)/||e||2 = 0. Applying Taylor’s theorem again,

f 1∗ ˜0∗ 1∗ 0 φ − β(φ ) = φ − β(φ ) − Jβ(Jγ e + o(||e||2))

− o(||(Jγ e + o(||e||2)||2) f 1∗ 0 = φ − β(φ ) + JβJγ e − o(||e||2)

= (I − JβJγ )e − o(||e||2)

Then,

1∗ ˜0∗ 2 T ||φ − β(φ )||2 = ((I − JβJγ )e − o(||e||2)) ((I − JβJγ )e − o(||e||2)) T T T = e (I − JβJγ ) (I − JβJγ )e − o(||e||2) (I − JβJγ )e T T T − e (I − JβJγ ) o(||e||2) + o(||e||2) o(||e||2) T T 2 = e (I − JβJγ ) (I − JβJγ )e + o(||e||2) 2 2 ≤ µ||e||2 + |o(||e||2)|

T 2 2 where µ is the largest eigenvalue of (I−JβJγ ) (I−JβJγ ). If e is sufficiently small so that |o(||e||2))| < (1 − µ)||e||2, then

f 1∗ 1 ˜0∗ 2 2 1∗ 1 2 ||φ − φmaxσ(kdW φ )||2 ≤ ||e||2 = ||φ − φmaxσ(E[V ])||2

T Note that the last step requires that µ, the largest eigenvalue of (I − JβJγ ) (I − JβJγ ), is below 1. Clearly, we do not actually have any guarantee of meeting this condition. However, our results show that even though the feedback weights are random and fixed, the feedforward weights actually learn to meet this condition during the first epoch of training (Figure A.1). Appendix A. Appendix for Project 1 121

A.1.2 Hidden layer targets

f t f ˜0∗ 0 1 1 0 Theorem A.1 shows that if we use a target φ = φ + σ(Y φ ) − σ(Y φmaxσ(kdW φ )) for the hidden layer, there is a guarantee that the hidden layer approaching this target will also push the upper layer closer to its target φ1∗, if certain other conditions are met. Our specific choice of φ0∗ defined in the Results (equation (3.8)) approximates this target rate vector using variables that are accessible to the hidden layer units. If neuronal units calculate averages after the network has reached a steady state and the firing rates f f 1 1 of neurons are in the linear region of the sigmoid function, φmaxσ(V ) ≈ φ . Using Lemma A.1, f f f f f f f f 1 1 0 0a 1 1 1 0a 0a E[V ] ≈ kdW φ and E[V ] ≈ Y φ . If we assume that V ≈ E[V ] and V ≈ E[V ], which is true on average, then:

f f f f f 0a 1 1 1 0 α = σ(V ) ≈ σ(Y φ ) ≈ σ(Y φmaxσ(V )) ≈ σ(Y φmaxσ(kdW φ )) (A.1)

and:

t t αt = σ(V 0a ) ≈ σ(Y φ1 ) (A.2)

Therefore, φ0∗ ≈ φ˜0∗. Thus, our hidden layer targets ensure that our model employs a learning rule similar to difference target propagation that approximates the necessary conditions to guarantee error convergence.

A.1.3 Lemma for firing rates

Theorem A.1 had to rely on the equivalence between the average spike rates of the neurons and their filtered spike trains. Here, we prove a lemma showing that this equivalence does indeed hold as long as the integration time is long enough relative to the synaptic time constants ts and tL.

Lemma A.1. Let X be a set of presynaptic spike times during the time interval ∆t = t1 −t0, distributed according to an inhomogeneous Poisson process. Let N = |X| denote the number of presynaptic spikes th during this time window, and let xk ∈ X denote the k presynaptic spike time, where 0 < k ≤ N. Finally, let φ(t) denote the time-varying presynaptic firing rate (i.e. the time-varying mean of the Poisson process), and s(t) be the filtered presynaptic spike train at time t given by equation (3.11). 2 2 2 2 Then, during the time window ∆t, as long as ∆t  2τLτs φ /(τL − τs) (τL + τs),

E[s(t)] ≈ φ

Proof. The average of s(t) over the time window ∆t is Appendix A. Appendix for Project 1 122

1 Z t1 s = s(t)dt ∆t t0 1 X Z t1 e−(t−xk)/τL − e−(t−xk)/τs = Θ(t − xk)dt ∆t τL − τs k t0

Since Θ(t − xk) = 0 for all t < xk,

1 X Z t1 e−(t−xk)/τL − e−(t−xk)/τs s = dt ∆t τL − τs k xk ! −(t1−x )/τL −(t1−x )/τs 1 X τLe k − τse k = N − ∆t τL − τs k

The expected value of s with respect to X is given by

" !# −(t1−x )/τL −(t1−x )/τs 1 X τLe k − τse k EX [s] = EX N − ∆t τL − τs k " N !# −(t1−x )/τL −(t1−x )/τs EX [N] 1 X τLe k − τse k = − EX ∆t ∆t τL − τs k=1

R t1 Since the presynaptic spikes are an inhomogeneous Poisson process with a rate φ, EX [N] = φdt. t0 Thus,

" N # 1 Z t1 1 X E [s] = φdt − E g(x ) X ∆t ∆t X k t0 k=1 " N # 1 X = φ − E g(x ) ∆t X k k=1

−(t1−x )/τL −(t1−x )/τs where we let g(xk) ≡ (τLe k − τse k )/(τL − τs). Then, the law of total expectation gives

" N # " " N ##

X X EX g(xk) = EN EX g(xk) N k=1 k=1 ∞ " N # !

X X = EX g(xk) N = n · P (N = n) n=0 k=1

Letting fxk (t) denote P (xk = t), we have that Appendix A. Appendix for Project 1 123

" N # n

X X EX g(xk) N = n = EX [g(xk)] k=1 k=1 n X Z t1 = g(t)fxk (t)dt k=1 t0

Since Poisson spike times are independent, for an inhomogeneous Poisson process:

φ(t) fx (t) = k R t1 φ(u)du t0 φ(t) = φ∆t

for all t ∈ [t0, t1]. Since Poisson spike times are independent, this is true for all k. Thus,

" N # n X 1 X Z t1 EX g(xk) N = n = g(t)φ(t)dt φ∆t k=1 k=1 t0 n Z t1 = g(t)φ(t)dt φ∆t t0

Then,

" N # ∞ ! ! X X n Z t1 EX g(xk) = g(t)φ(t)dt · P (N = n) φ∆t k=1 n=0 t0 ! ∞ ! 1 Z t1 X = g(t)φ(t)dt n · P (N = n) φ∆t t0 n=0

Now, for an inhomogeneous Poisson process with time-varying rate φ(t),

R t1 R t1 n − φ(t)dt [ φ(t)dt] e t0 P (N = n) = t0 n! [φ∆t]ne−(φ∆t) = n!

Thus, Appendix A. Appendix for Project 1 124

" N # ! ∞ ! X e−(φ∆t) Z t1 X [φ∆t]n EX g(xk) = g(t)φ(t)dt n φ∆t n! k=1 t0 n=0 ! e−(φ∆t) Z t1 = g(t)φ(t)dt (φ∆t)eφ∆t φ∆t t0

Z t1 = g(t)φ(t)dt t0

Then,

! 1 Z t1 EX [s] = φ − g(t)φ(t)dt ∆t t0

The second term of this equation is always greater than or equal to 0, since g(t) ≥ 0 and φ(t) ≥ 0 for all t. Thus, EX [s] ≤ φ. As well, the Cauchy-Schwarz inequality states that

s s Z t1 Z t1 Z t1 g(t)φ(t)dt ≤ g(t)2dt φ(t)2dt t0 t0 t0 s Z t1 q = g(t)2dt φ2∆t t0 where

Z t1 Z t1 τ e−(t1−t)/τL − τ e−(t1−t)/τs 2 g(t)2dt = L s dt t0 t0 τL − τs 2 2 ! 1 τLτs ≤ 2 4 2(τL − τs) τL + τs 2 2 2τLτs = 2 (τL − τs) (τL + τs)

Thus,

s Z t1 2 2 q 2τLτs 2 g(t)φ(t)dt ≤ 2 φ ∆t t0 (τL − τs) (τL + τs) s √ 2 2 2 2τLτs φ = ∆t 2 (τL − τs) (τL + τs)

Therefore, Appendix A. Appendix for Project 1 125

s √ 2 2 2 1 2τLτs φ EX [s] ≥ φ − ∆t 2 ∆t (τL − τs) (τL + τs) s 2 2 2 2τLτs φ = φ − 2 ∆t(τL − τs) (τL + τs)

Then,

s 2 2 2 2τLτs φ φ − 2 ≤ EX [s] ≤ φ ∆t(τL − τs) (τL + τs)

2 2 2 2 Thus, as long as ∆t  2τLτs φ /(τL − τs) (τL + τs), EX [s] ≈ φ.

What this lemma says, effectively, is that the expected value of s is going to be roughly the average presynaptic rate of fire as long as the time over which the average is taken is sufficiently long in comparison to the postsynaptic time constants and the average rate-of-fire is sufficiently small. In our simulations, ∆t is always greater than or equal to 50 ms, the average rate-of-fire is approximately 20 Hz, and our time constants τL and τs are 10 ms and 3 ms, respectively. Hence, in general:

2 2 2 2 2 2 2 2 2τLτs φ /(τL − τs) (τL + τs) = 2(10) (3) (0.02) /(10 − 3) (10 + 3) ≈ 0.001  50

Thus, in the proof of Theorem A.1, we assume EX [s] = φ. Appendix A. Appendix for Project 1 126

Parameter Units Value Description dt ms 1 Time step resolution

φmax Hz 200 Maximum spike rate

τs ms 3 Short synaptic time constant

τL ms 10 Long synaptic time constant

∆ts ms 30 Settle duration for calculation of average voltages

gb S 0.6 Hidden layer conductance from basal dendrites to the soma

ga S 0, 0.05, 0.6 Hidden layer conductance from apical dendrites to the soma

gd S 0.6 Output layer conductance from dendrites to the soma

gl S 0.1 Leak conductance V R mV 0 Resting membrane potential

Cm F 1 Membrane capacitance

P0 – 20/φmax Hidden layer error signal scaling factor 2 P1 – 20/φmax Output layer error signal scaling factor

Table A.1: List of parameter values used in our simulations. Appendix A. Appendix for Project 1 127

A.2 Supplemental figures

Figure A.1: Weight alignment during first epoch of training. (A) Plot of the maximum eigenvalue T of (I − JβJγ ) (I − JβJγ ) over 60,000 training examples for a one hidden layer network, where Jβ and Jγ are the mean feedforward and feedback Jacobian matrices for the last 100 training examples. The T maximum eigenvalue of (I − JβJγ ) (I − JβJγ ) drops below 1 as learning progresses, satisfying the main condition for the learning guarantee described in Theorem A.1 to hold. (B) The product of the mean feedforward and feedback Jacobian matrices, JβJγ , for a one hidden layer network, before training (left) and after 1 epoch of training (right). As training progresses, the network updates its weights in a way that causes this product to approach the identity matrix, meaning that the two matrices are roughly inverses of each other. Appendix A. Appendix for Project 1 128

Figure A.2: Learning with stochastic plateau times. (A) Left: Raster plot showing plateau potential times during presentation of two training examples for 100 neurons in the hidden layer of a network where plateau potential times were randomly sampled for each neuron from a folded normal distribution (µ = 0, σ2 = 3) that was truncated (max = 5) such that plateau potentials occurred between 0 ms and 5 ms before the start of the next phase. In this scenario, the apical potential over the last 30 ms was integrated to calculate the plateau potential for each neuron. (B) Plot of test error across 60 epochs of training on MNIST of a one hidden layer network, with synchronized plateau potentials (gray) and with stochastic plateau potentials (red). Allowing neurons to undergo plateau potentials in a stochastic manner did not hinder training performance.

Figure A.3: Importance of weight magnitudes for learning with sparse weights. Plot of test error across 20 epochs of training on MNIST of a one hidden layer network, with regular feedback weights (gray), sparse feedback weights that were amplified (red), and sparse feedback weights that were not amplified (blue). The network with amplified sparse feedback weights is the same as in Figure 3.8A & B, where feedback weights were multiplied by a factor of 5. While sparse feedback weights that were amplified led to improved training performance, sparse weights without amplification impaired the network’s learning ability. Right: Spreads (min – max) of the results of repeated weight tests (n = 20) after 20 epochs for each of the networks. Percentages indicate means (two-tailed t-test, regular vs. −34 sparse, amplified: t38 = 44.96, P38 = 4.4 × 10 ; regular vs. sparse, not amplified: t38 = −51.30, −36 −47 P38 = 3.2 × 10 ; sparse, amplified vs. sparse, not amplified: t38 = −100.73, P38 = 2.8 × 10 , Bonferroni correction for multiple comparisons). Appendix B

Appendix for Project 2

In this supplementary text, we explore the link between the standard backpropagation algorithm (back- prop) and plasticity in a network with burst-dependent learning rules (burstprop). As a counterpoint to the main text which proceeded bottom-up (from a spike-based to a rate-based description), this sup- plementary text follows a top-down approach. We first briefly review the backprop algorithm in section B.1. Then, section B.2 establishes formal links between quantities used in backprop and observable features of neuronal responses in a quasi-static framework. Section B.3 extends this framework to a time-dependent rate model. Section B.4 connects the normative approach to a bottom-up derivation of the rate-based learning rule, with the objective to relate the “microscopic” scale—involving single neurons, synapses, and spikes—to the “macroscopic” scale—involving neural populations, weights, and rates. Finally, section B.5 details the training procedure of the quasi-static networks, while section B.6 gathers supplementary figures.

B.1 Backprop

In backprop [123], the goal is to minimize a loss (or cost) function L(y, d) that depends on a desired output d and the network’s prediction y in response to an input x.1 We consider a network with L + 1 layers, l = 0 being the input layer and l = L the output layer. In the backprop algorithm with stochastic gradient descent and without mini-batch, each training example is divided into three phases: a feedforward phase, a backward phase and a learning phase. In the feedforward phase, the hidden-layer and output-layer activities are computed sequentially. For each layer l, the activity al is computed as

al = fl(Wlal−1) (B.1) for l = 1,...,L, with a0 = x. The matrix Wl is a weight matrix connecting layer l − 1 to layer l. The M M activation function fl : R l → R l may dependent on the layer, where Ml is the number of units in layer l.

1All vectors are column vectors and are denoted by lowercase boldface symbols (e.g., y). Matrices are denoted by boldface capital letters (e.g., W). The subscript l, as in Wl, denotes the quantity for layer l. When referring to an element l of a vector or a matrix, for convenience the l subscript becomes a superscript, e.g. Wij denotes element ij of matrix Wl. The superscript T denotes a matrix transpose and is the elementwise (Hadamard) product.

129 Appendix B. Appendix for Project 2 130

In the backward phase, the errors are sequentially backpropagated from the output layer l = L to the first hidden layer l = 1. The error at layer l corresponds to the gradient gl = ∇vl L with respect to vl = Wlal−1, the weighted sum of inputs to that layer. At the output layer, the error is computed directly from the loss function as gL = ∇vL L(aL, d). The hidden-layer errors are then calculated using

0 0 T gl = fl (vl) ∇al L = fl (vl) [Wl+1gl+1] (B.2) recursively. In the learning phase, weights are changed in the direction opposite to their gradients:

T ∆Wl = −η∇Wl L = −ηglal−1, (B.3) with η a learning rate. We note that a strict separation between the backward and learning phase is artificial: we could update the weights as soon as the required quantities become known. On the other hand, the sequential nature of the feedforward phase and the backward phase is mandatory. This forces a temporal relationship between quantities that are nevertheless conceived of as static within each example. We will contrast such a quasi-static perspective with explicit time dependence in section B.3.

B.2 Quasi-static burstprop

In this section, we explore the logical consequences of relating backprop quantities (al and gl) to bio- physical changes in bursting. The objective is to build a set of self-consistent equations paralleling Eqs. B.1-B.3 by introducing experimentally-derived constraints. As discussed in the main text, a number of studies have attempted to capture the credit assignment properties of backprop with more biologically plausible implementations. To contrast our approach with others and to provide a quick rationale for our philosophy in constructing the burstprop model, Table B.1 compares the burstprop algorithm with other recent approaches.

B.2.1 Derivation

Constraint 1: Feedforward Communication Because the synaptic connections going up the hi- erarchy have been shown to be short-term depressing [208], these feedforward connections are likely to communicate ensemble event rates [115] (ER; el). Therefore, we hypothesize that the activities in back- prop, al, can be interpreted as ensemble event rates; the feedforward phase in backprop is thus written as

el = fl(vl), (B.4) where the event rate is a function of feedforward inputs only and vl = Wlel−1. This interpretation is distinct from the interpretation of activities as firing rates that was assumed in the many studies examining biological alternatives to backprop [239, 41, 189, 114, 101]. Specifically, by allowing events to be either singlets or bursts, this interpretation allows another state variable to be represented and communicated by neuronal ensembles [115].

Constraint 2: Conjunctive Bursting with Two Sites of Integration Next, we need to consider the backpropagation of errors. Since feedback connections strongly target apical dendrites and feed- Appendix B. Appendix for Project 2 131

Table B.1: Comparison of bio-inspired credit-assignment algorithms. We restricted the comparison to works which used feedforward nets or convergent recurrent neural networks and published after 2015. Algorithms had to have been tested on standard benchmark tasks. We excluded de facto algorithms that were using weight transport. We assessed the performance by computing the ratio of the test error for the standard backprop algorithm versus that of the proposed algorithm. When multiple versions of an algorithm were tested, we tried to select the best performing one or the most biologically plausible one. Refs: a[43], b[104], c[103], d[46], e[191], f[269], g[189], h[42], i[101], j[109], k[192], l[110], m[100], n[193], o[114].

error ratio error

Validation No direct comparison with BP. No direct NuSF versus SGD CONV tanh model LIF network, learning depth 3 top-5 top-5, AlexNet Nøkland et al., 2016, Conv tanh From Xiao et al., 2018, AlexNet, top-5. From with BP w/o DO SCFB+DO compared 5 layers tanh activation function top-1, ResNet-18 simpnet - URFB (performances since the Comparison with BPTT, Locally connected, SDTP parallel, LocCon for 1st layer; deepest model; S7 Suppl. figure top-5 Notes 1 1 2 1 model 2 1 2 1 2 1 1 1 bar graphs) estimated from 1 RNN network is a convergent 1 with BP compared 2 1 1st layer Conv with Error-BP, compared * 1 2 ) ) 2 2

) ) 2 1,2 *, -) , -)

1 , -) 2 1 , 1.06 , 0.30 1 , -, -) 1 ) 1 0.88 , -, -) 1,2

1 , , -, -) *, 0.90 , -, 0.51 2 1 1 1 , -, -) , 0.77 , ~0.92 , 0.85 1 1 1 1 , 0.72 1 , 0.82, -, 0.71 (-, -, 1 , 0.84 1 , 0.70 1 (0.78, -, -) *, 0.81 1 (0.47 (-, 0.72 1 Performance , 0.56 1 (3.2% error (0.80 (0.74, 0.92, 0.94, -) (-, ~0.82 (1.06 (1.05 test error ratio BP/model test error (0.42 (0.46 (1.14, 0.83 (1.12 MNIST, CIFAR10, CIFAR100, ImageNet) CIFAR100, CIFAR10, MNIST, (

N N N N N N N N N N N N D S, D targeting: (D), none (N) Interneurons soma (S), dend. 1 N N N N N N N N N N N N N Y STP

Y N N N N N N N N N N N Y Y ments Dendritic compart- N N N Y N N N N N N N N Y Y Online N N N N N N N N N N N N N Y in vitro in vitro Matches plasticity Y N N N N Y N N N N N N N Y based Spike-

erential erential ff Name of algorithm N/A Sign- symmetry Feedback Alignment Local Error Learning Direct Feedback Alignment Broadcast Alignment Weight Mirror Updated Random Feedback Equilibrium Propagation N/A Simplified Di Target Propagation Q-AGREL N/A Burstprop a j l f k e d g h m

c b i n Paper et al., 2017 Guerguiev Liao et al., 2016 Xiao et al., 2018 Lillicrap et al., 2016 Mostafa et al., 2018 Nøkland et al., 2016 Samadi et al., 2017 et al., 2019 Akrout Amit, 2019 Laborieux et al., 2020 Lansdell et al., 2020 Bartunov et al., 2018 Pozzi et al., 2018 Sacramento et al., 2018 et Payeur & Guerguiev al., 2020 (this paper) FB weights Fixed Learned Appendix B. Appendix for Project 2 132 forward connection strongly target basal dendrites [223, 224], we assume two loci of integration: one summing feedforward information and the other summing feedback information. Experimental evidence has shown that the burst rate (BR; bl) results from the conjunction of these two input streams [60, 220,

221]. We assume that the burst probability (BP; pl), defined as BR/ER, is controlled solely by feedback inputs, ul, through a sigmoid

pl = σ(βul + α), (B.5) where α and β are scaling and offset parameters reflecting properties of the neuronal ensemble (Fig. B.5). By definition, the burst rate is then the multiplication of a nonlinear readout of feedforward inputs and a nonlinear readout of feedback inputs:

bl = pl el. (B.6)

Importantly, we arrive at an expression that is analogous to Eq. B.2, where a nonlinear readout of feedforward information must be multiplied by a linear readout of feedback information to obtain the hidden-layer errors. This analogy suggests that BRs are well-poised to represent hidden-layer errors.

Constraint 3: Signed and Unit-Specific Error Signals The above constraints suggest that BRs encode and communicate hidden-layer errors. However, BRs are strictly positive whereas errors are signed. Therefore we cannot ascribe gl to bl directly. Separating the representation of positive and negative errors into different ensembles is not plausible because both signs must be accessible to the unit whose weights are to be steered. One possible solution would be to consider deviations of the BRs with respect to a constant and global reference point, such as assigning gl to bl − b0, where b0 is a constant.

We found this approach to be intractable because, in the absence of output errors, bl − b0 should vanish everywhere, yet feedback connections still communicate changing and unit-specific signals. These considerations suggest a more careful examination of the network state without output error. For BRs to act as an error signal to drive plasticity, the particular BRs reached by the network in the absence of output error should produce no net plasticity. These “reference BRs” are determined by backpropagation from the top layer. In absence of any output error, all output-layer BPs should reach a constant value. Let us denote these possibly unit-specific reference BPs as pL. Then, the reference output BPs and the output ERs combine to produce the reference BRs: bL = pL eL. In the layers below, the reference BRs are integrated by the apical dendrites to yield the reference dendritic potentials uL−1, and then pL−1 = σ(βuL−1 + α), and so on for all other layers. In this way, we find that reference BRs are unit-specific and depend on the stimulation pattern through ERs as well as on the state of the network through the weight matrices.

We now consider how changes upon a perturbation of the output-layer BPs pL could be used as signed signals to steer plasticity. The reference BRs and BPs are computed in the absence of a teacher as per the previous paragraph. When a teacher signal is introduced, its effect backpropagates through the network to generate BPs and BRs. Comparing bursting with and without the teacher, δpl = pl − pl forms a signed signal, which could be ascribed to backprop hidden-layer errors.

Constraint 4: Hebbian Learning Rule The next step is to see how these constraints and assump- tions yield a burst-dependent plasticity rule approximating gradient descent. The backprop learning rule

(Eq. B.3) with the interpretation al−1 = el−1 and gl = γδbl suggests that we can obtain a Hebbian learning rule if we choose γ = −1. Thus we postulate that the hidden-layer errors are represented in Appendix B. Appendix for Project 2 133 variations of the burst rate

gl = −δbl, (B.7) such that the backprop learning rule becomes

T T ∆Wl = ηδblel−1 = η(δpl el)el−1. (B.8)

This rule is Hebbian since potentiation result from a nonzero presynaptic ER and a positive postsynaptic T BR deviation. It is a three-factor learning rule since the Hebbian factor elel−1 is controlled by the signed factor δpl. At this point, the quantities b and b correspond to two distinct network equilibria, similar to the theory of contrastive Hebbian learning [108, 41]. We will address this artificial construct in sections B.3 and B.4, where we show that the dynamics of the burst-dependent learning rule are such that an estimate of δb is available at synapses and drives plasticity of the form of Eq. B.8 provided that we introduce a plasticity gate (factor M).

Constraint 5: Feedback Communication We now need expressions for ul, ul and δul ≡ ul − ul to obtain the burstprop equivalent of the backpropagation of error. To derive these expressions, we substitute δbl for gl and el for al in Eq. B.2 to obtain

0  T  δpl el = f (vl) Wl+1δbl+1 .

Since δpl is a function of ul and ul, this expression provides an implicit definition of δul in terms of the state of layers l and l + 1 .

To establish more explicit constraints, we consider the case where both ul and ul are within the 0 linear regime of σ(·). If el 6= 0 and if we assume for simplicity that σ (α)β = 1, the equation above gives an expression for δul in terms of the activation function

0 −1 T δul = f (vl) el [Wl+1δbl+1].

0 −1 For an exponential activation function (see Table B.2 for the general formulation), f (vl) el = 1 and this expression implies that the feedback connections should backpropagate the BRs from the layer above:

ul = Ylbl+1, (B.9)

T and similarly for ul, with bl+1 replaced by bl+1. Here, the feedback matrix Yl replaces the matrix Wl+1 that appears in standard backprop. As described in the main text, Yl is initially a random matrix, as in the feedback alignment algorithm [46], but can be learned using the Kolen-Pollack algorithm [42] (see Methods).

Encoding Error at the Output Layer To complete the picture, we need to define the error at the output layer. For this, the output-layer BPs are set to a nonlinear function of the gradient of the loss with respect to the output ERs

0 −1  pL = ζ pL − f (vL) eL ∇eL L , Appendix B. Appendix for Project 2 134

Table B.2: Summary of the main equations of burstprop and backprop. We have defined h(eL) ≡ 0 −1 f (vL) eL . For a sigmoid: h(el) = 1 − el; for an exponential: h(el) = 1. Burstprop Backprop

e0 = x a0 = x el = fl(Wlel−1) al = fl(Wlal−1) (0) T pL = pL (1, 1,..., 1) 0 gL = fL(vL) ∇aL L pL = ζ(pL − h(eL) ∇eL L)

ul = h(el) (Ylbl+1) pl = σ(βul + α) bl = pl el 0 T gl = fl (vl) [Wl+1gl+1] ul = h(el) (Ylbl+1) pl = σ(βul + α) bl = pl el T T ∆Wl = ηδblel−1 ∆Wl = −ηglal−1

ML where ζ is any squashing function making sure that pL ∈ [0, 1] . For simplicity, we shall use

(0) T pL = pL (1, 1,..., 1) as the reference output BP. The squashing function does not have to be a sigmoidal function since its argument does not represent a dendritic potential per se.

B.3 Time-dependent rate model

In the last section, we derived a burst-dependent model of credit assignment by identifying a number of key biophysical features of neural coding and information propagation with variables from the backprop algorithm. We called this framework quasi-static because time appeared only through ad hoc tempo- ral relationships between the quantities involved, reminiscent of the sequential steps that need to be performed in numerical implementations of these models. For instance, the reference BPs, pl, and the perturbed BPs, pl, were established at two different “times”. However, no mechanisms were suggested to explain how pl could serendipitously continue to exist until the output error signal is applied, so that it can be compared with pl (other than artificially in computer memory). In the present section, we address such aspects of the quasi-static model by describing a continuous-time implementation of burstprop. This exposition will also help establish the link between the spike-based and the rate-based burstprop learning rules in section B.4.

B.3.1 Dynamics

Feedforward propagation Each example is presented for a total duration T , during which the ERs evolve according to dv τ l = −v (t) + W e (t), (B.10) v dt l l l−1 Appendix B. Appendix for Project 2 135

with el(t) = fl[vl(t)] and e0(t) ≡ x. Section B.4 provides a heuristic derivation of Eq. B.10. We neglected possible propagation delays from layer to layer. With constant weights, the ERs approach the ∗ ∗ steady states el = fl(Wlel−1) with time constant τv.

Backpropagation During the first part of each example, when the error signal is absent, pL can be (0) T set to a constant vector pL (1, 1,..., 1) . This part is called the prediction interval and lasts Tpred. In the remainder of the example—the teaching interval of duration Tteach = T −Tpred—the error is encoded into pL, similarly to the quasi-static model. At all times, the hidden-layer BPs obey pl(t) = σ[βul(t)+α] with du τ l = −u + f 0(v ) e−1 Y b , (B.11) u dt l l l l l+1 where τu is a time constant for dendritic integration.

To compute δpl = pl −pl, we must achieve a local representation of two quantities evolving conjointly in time. We consider pl to be a moving average of pl

de τ l = e − e avg dt l l db τ l = b − b avg dt l l

pl(t) = bl(t)/el(t) (elementwise),

2 where el (resp. bl) is an exponential moving average of the event (resp. burst) rate .

Controlling plasticity

If the differential weight changes dWl(t) are proportional to δpl(t) = pl(t) − pl(t), then whenever

δpl(t) 6= 0 a weight update occurs (provided that the pre and post ERs are nonzero). This can lead to unsupervised plasticity and inadequate learning when fluctuations in pl(t) are due to changes in ERs rather than being directly caused by the error encoded in pL(t). We delineate two ways with which unsupervised plasticity will take place. Firstly, according to Eq. B.11, unsupervised plasticity happens whenever the ERs respond to stimulus onset or offset. Therefore, weights should not be updated during the stimulus-evoked transients, but rather during the sustained firing period in between. To ensure weight updates follow supervised errors, we use the term M(t) ∈ [0, 1] to control learning, setting M close to 1 during the teaching intervals and close to 0 otherwise. Second, ERs changes due to learning during the teaching interval can introduce a similar effect.

However, a small learning rate, short teaching intervals and using eL(t) instead of eL(t) to compute the output errors, can mitigate these undesired sources of plasticity. In summary, the forward weights are updated according to

dW l = ηM(t)[p (t) − p (t)] e (t)eT (t). dt l l l l−1

Figure B.4 illustrates the learning process. As a validation of the model, the next subsection shows that

2 dpl el With these definitions, τavg = (pl − p ) , and the estimated burst probability follows the actual burst probability dt l el with an effective time constant τavgel/el. However, performing the average over el and bl and then computing pl is closer to what has been done in the spiking simulations (see Methods in main text). Appendix B. Appendix for Project 2 136 the quasi-static model is a limiting case of the time-dependent model.

B.3.2 Limiting case

To ensure that the time-dependent model is consistent with the quasi-static model, we study below a limiting case of the discretized time-dependent model in which the time bins match the integration time constant. Here, activation functions are exponential, for simplicity.

Feedforward propagation The feedforward equations are discretized on time bins of duration dt

 dt  dt vl[t] = 1 − vl[t − 1] + Wl[t − 1]el−1[t − 1], τv τv with el[t] = fl(vl[t]). We shall set dt = τv in the limiting case, so that

vl[t] = Wl[t − 1]el−1[t − 1].

This corresponds to the quasi-static feedforward phase once we associate t with l: at time t = 0 of processing example m, the input-layer activity is set to x(m) and, at time t = L, the prediction for example m, eL[L], is completed. The layer-l ERs reach their steady state at time t = l.

Backpropagation The discretized event and burst moving averages obey

dt el[t] = el[t − 1] + (el[t − 1] − el[t − 1]), τavg and similarly for b. In the limiting case, we set τavg = dt, so that el[t] = el[t − 1] and bl[t] = bl[t − 1]. As a consequence,

pl[t] = pl[t − 1], where pl[t] = σ(βul[t] + α) and ul[t] = Ylbl+1[t − 1] after discretizing Eq. B.11 with τu = τv. We note that, at time L + 1,

(0) bL[L + 1] = bL[L] = pL eL[L], which corresponds to the reference output BR of the quasi-static model. Moreover, at time t = L + 2, at layer L − 1,

pL−1[L + 2] = pL−1[L + 1] = σ(βYL−1bL[L + 1] + α), and the limiting case has thus recovered the reference dendritic potential of the quasi-static model,

YL−1bL[L + 1]. Similar expressions can be derived for the other layers. At time L + 1, pL[L + 1] can be set to

 pL[L + 1] = ζ pL[L + 1] − ∇eL[L+1]L ,

(0) T where eL[L + 1] = eL[L] and pL[L + 1] = pL (1, 1,..., 1) corresponds to the reference output BP of the quasi-static model. Therefore, error backpropagation in the limiting case corresponds to error Appendix B. Appendix for Project 2 137 backpropagation in the quasi-static model.

Learning Finally, the discrete-time weight updates are given by

T Wl[t + 1] = Wl[t] + dtηM[t](pl[t] − pl[t]) el[t]el−1[t].

We can identify dtη above with the learning rate for the quasi-static model. The term M[t] turns off plasticity when switching examples, something that is done implicitly in the quasi-static model. In sum, this limiting case of the time-dependent rate model agrees with the quasi-static model.

B.4 Linking the rate-based and spike-based models

All the models above are coarse-grained models wherein the variables are ensemble averages over a local l population. For instance, en is the ER of population n in layer l. In this section, we relate these “macroscopic” variables at the population level to the “microscopic” variables at the level of neurons and synapses. Such a link pertains to the mean-field theory of spiking neural networks [233, 232, 231]. Here, we provide a heuristic derivation for current-based synapses, generalized linear model neurons and all-to-all connections.

l Single-neuron dynamics Let Vn,i(t) be the somatic membrane potential relative to rest of pyramidal neuron i in population n of layer l at time t. If Ml−1 is the number of pyramidal neuron ensembles in l layer l − 1 and N is the number of neurons per population, then a generalized linear model for Vn,i can be written Ml−1 N N l X X nm l−1 X n l Vn,i(t) = wij (E ∗ Em,j )(t) + qik(I ∗ Sn,k)(t), (B.12) m=1 j=1 k=1 where we have neglected refractory (post-spike) effects, a reasonable assumption for low spiking activ- ity. The first term represents the excitatory effect of presynaptic events on the postsynaptic potential, whereas the second term represents the response to inhibition coming from local interneurons. The l−1 l neuron processes incoming event (Em,j ) or spike (Sn,k) trains with filters E and I , respectively. The n nm amplitude of the voltage responses are given by the inhibitory (qik < 0) and excitatory (wij > 0) nm synaptic weights. More precisely, wij connects presynaptic pyramidal neuron j of population m in layer l − 1 to postsynaptic pyramidal neuron i of population n in layer l. Events are produced using an l inhomogeneous Poisson process with firing intensity function fE(Vn,i(t)) [115]. The coupling between the somatic and dendritic compartments has been omitted for simplicity. An equation similar to Eq. B.12 could be written for the dendritic compartments, and then we could use the sigmoidal transfer function to get the burst probabilities. In what follows, we shall focus on the feedforward propagation of events, keeping in mind that similar steps can be followed for the feedback pathway.

Coarse-graining In the spiking simulations reported in the main text, the synaptic weights were sparsely distributed and the nonzero weights were all initialized to the same value. Of course, for plastic weights, the nonzero weights changed with learning and distributed themselves over time. However, to drastically simplify the following derivation, we shall assume that both the excitatory and inhibitory synaptic weights are fixed, with zero variance, and that the connectivity between two populations is Appendix B. Appendix for Project 2 138 all-to-all (no vanishing weights). Our goal here is to convey in a qualitative fashion how the rate-based model relates to a simplified spike-based framework. We define

nm l n l wij = Jnm/N and qij = Qn/N,

l l where Jnm > 0 and Qn < 0. The membrane potentials of population-n neurons then becomes indepen- dent of the specific neuron i. Defining the macroscopic ER of population m in layer l − 1 by

N l−1 1 X l−1 em (t) = lim Em,i (t) N→∞ N i=1 and the population activity of the inhibitory neurons by

N l 1 X l in(t) = lim Sn,i(t), N→∞ N i=1 in the limit N → ∞ Eq. B.12 becomes

Ml−1 l X l l−1 l l vn(t) = Jnm(E ∗ em )(t) + Qn(I ∗ in)(t). m=1

l l Here, we substituted vn(t) for Vn(t) because this equation only involves macroscopic quantities. In the simple case of a linear-nonlinear model, we can link this potential to the population ER by l l using en(t) = fE(vn(t)) [270, 271]. The inhibitory interneurons’ population activity can itself be written as   Ml−1 l X (ie) l−1 in(t) = fI  Jnm (E ∗ em )(t) , m=1

(ie) where Jnm > 0, and we assumed that the only synaptic connections onto local inhibitory neurons come from layer-(l − 1) pyramidal neurons with the same kernel E as above. If the link function fI is approximately linear then   Ml−1 l X h l l (ie)  l−1i en(t) = fE  JnmE + QnJnm (E ∗ I ) ∗ em (t) . m=1

l For the quasi-static model, we can thus loosely identify the macroscopic weights Wnm with

l l l (ie) Wnm ∼ AJnm + BQnJnm (B.13)

l where A and B are positive constants corresponding to the integrated synaptic kernels. Since Qn < 0, l Wnm can be either positive or negative. Therefore, the weights appearing in the macroscopic learning rule (Eq. B.8) should be interpreted in the sense of Eq. B.13. We assumed that only the pyramid- l to-pyramid synapses are plastic, i.e. Wnm is composed of plastic excitation over a pool of nonplastic inhibition. For the time-dependent rate model, we can now justify Eq. B.10 if we assume that the Appendix B. Appendix for Project 2 139

effective kernel κeff , defined by identifying

l l l (ie) Wnmκeff ∼ JnmE + QnJnm (E ∗ I ), (B.14) can be approximated by an exponential filter.

B.4.1 Linking the learning rules

Equipped with the heuristic results of the last section, we can now relate the spike-based and rate-based learning rules. The objective here is to recover

dW l nm = η(pl − pl )el el−1 dt n n n m by ensemble-averaging the spike-based learning rule. For convenience, we recall that the spike-based rule reads (Eq. 4.2 in main text)

dwij = η(Bi − P iEi)Eej, (B.15) dt where it is implicit that presynaptic neuron j belongs to population m in layer l − 1 and postsynaptic neuron i to population n in layer l, and we omitted the time arguments for clarity. Also, since wij scales l as O(1/N), we make the spike-based learning rate η scale as O(1/N) as well. Replacing wij = Jnm/N and taking the ensemble average of the right-hand side yields

l dJnm 0 ≈ η (hBii − hP iEii)hEeji (B.16) dt

0 where η = limN→∞ Nη, which is well-defined according to the aforementioned scaling of η. We assumed that a given pre-post neuron pair is weakly correlated, i.e., the probability that postsynaptic neuron i fires when its presynaptic partner j produces an event is small on average. Since

Eej(t) = (κ ∗ Ej)(t)

−t/τpre l−1 l−1 where κ(t) = e Θ(t) and τpre ∼ 10 ms, then hEe(t)i = (κ∗em )(t). If em varies slowly with respect l−1 to κ, then hEe(t)i ≈ τpreem (t) and we recover the presynaptic term of the rate-based learning rule. On the postsynaptic side, if, as mentioned in the main text, the probability of converting an event into a burst is weakly correlated with the somatic processes that led to the event in the first place, then we can write

l l l hBii − hP iihEii = bn − pnen,

l l l where bn = pnen. Thus, we now have

dJ l nm ≈ η00(pl − pl )el el−1, dt n n n m

00 0 l where η = τpreη . In Eq. B.14, if I (t) = δ(t) and E is an exponential filter, then Wnm can be l l (ie) l l l (ie) directly identified with Jnm + QnJnm . In this case, dJnm/dt = dWnm/dt because QnJnm is constant. Appendix B. Appendix for Project 2 140

Assuming that I (t) = δ(t) is justified if the time scale of inhibitory postsynaptic potentials is much smaller that the time scale of excitatory postsynaptic potentials evoked by presynaptic events. This assumption is supported by the fact that events are detected by the comparatively slow process of short- term depression. Under all these assumptions, we thus recover the rate-based learning rule with the correct ER dynamics. We note that burst size is known to control the amplitude of plasticity [228] and could be introduced in our spike timing learning rule to obtain M(t) in the rate-based learning rule.

B.5 Models trained on MNIST, CIFAR-10 and ImageNet

B.5.1 Model architectures

In Fig. 4.5 of the main text, a 784-500-500-500-10 fully-connected network with sigmoid units was trained on MNIST. Table B.3 shows the network architectures used to train on MNIST, CIFAR-10 and ImageNet in Fig. 4.6 of the main text and in supplementary Fig. B.8, with the exception of networks trained on CIFAR-10 and ImageNet with fixed feedback weights, whose architectures were chosen to have the same number of learnable parameters as those with learned feedback weights, and are shown in Table B.4.

Layer MNIST CIFAR-10 ImageNet Input 28 × 28 × 1 32 × 32 × 3 224 × 224 × 3 1 Conv 4 × 4, 8, Stride 2 Sigmoid Conv 5 × 5, 64, Stride 2 Sigmoid Conv 9 × 9, 48, Stride 4 ReLU 2 Conv 3 × 3, 16, Stride 2 Sigmoid Conv 5 × 5, 128, Stride 2 Sigmoid Conv 3 × 3, 48, Stride 2 ReLU 3 FC 500 Sigmoid, Recurrent Conv 3 × 3, 256 Sigmoid Conv 5 × 5, 96 ReLU 4 FC 500 Sigmoid, Recurrent FC 1024 Sigmoid, Recurrent Conv 3 × 3, 96, Stride 2, ReLU 5 FC 10 Sigmoid FC 10 Sigmoid Conv 3 × 3, 192 ReLU 6 – – Conv 3 × 3, 192, Stride 2, ReLU 7 – – Conv 3 × 3, 384 ReLU 8 – – FC 1000 Softmax Trainable Params 1,588,432 4,432,576 3,539,856 (Burstprop)

Table B.3: Network architectures used to train on MNIST, as well as CIFAR-10 and ImageNet experi- ments using backprop, node pertubation and burstprop with learned feedback weights, in Figs. 4.6 and B.8.

Layer CIFAR-10 ImageNet Input 32 × 32 × 3 224 × 224 × 3 1 Conv 5 × 5, 64, Stride 2 Sigmoid Conv 9 × 9, 48, Stride 4 ReLU 2 Conv 5 × 5, 256, Stride 2 Sigmoid Conv 3 × 3, 96, Stride 2 ReLU 3 Conv 3 × 3, 256 Sigmoid Conv 5 × 5, 96 ReLU 4 FC 1480 Sigmoid Conv 3 × 3, 192, Stride 2, ReLU 5 FC 10 Sigmoid Conv 3 × 3, 192, ReLU 6 – Conv 3 × 3, 384, Stride 2, ReLU 7 – Conv 3 × 3, 470 ReLU 8 – FC 1000 Softmax Trainable Params 4,428,944 3,539,072

Table B.4: Network architectures used to train on CIFAR-10 and ImageNet with fixed feedback weights in Fig. 4.6. Appendix B. Appendix for Project 2 141

B.5.2 Activation functions, burst probabilities and weight update rules

In the MNIST and CIFAR-10 networks, the sigmoid activation function is used at each layer to compute the ERs. In order to incorporate information about the derivative of the activation function of the ERs, the burst probabilities of units in the output layer L are given by:

(0) T pL = pL (1, 1, ..., 1) (0)  pL = pL (d − eL)(1 − eL) + 1

(0) where pL is a constant baseline BP (in these experiments, set to 0.2) and d is the target signal. For hidden-layer units at layer l, the BPs are given by (α = 0, β = 1)

pl = σ(ul)

pl = σ(ul) where ul and ul are given by

ul = (1 − el) (Ylbl+1)

ul = (1 − el) (Ylbl+1).

We note that for convolutional layers, Yl is convolved with bl+1 and bl+1.

In the ImageNet network, the ReLU activation function is used in hidden layers, while the output- layer units have a softmax activation function. Here, the output BPs are given by:

(0) T pL = pL (1, 1, ..., 1) n (0) o pL = min 1, pL κ(d − eL)/(eL + ) + 1

−8 −5 where  (set to 10 ) prevents division by zero, and κ (set to 10 ) is chosen such that pL spans the (0) range of (0, 1). This burst probability formulation ensures that pL κ(pL − pL) eL ∝ (d − eL), and therefore reflects the gradient of the cross-entropy loss function, as long as eL is not very small (i.e. eL > κ). For units in the final hidden layer L − 1, the BPs are obtained as above, but now with

 Θ(eL−1)  uL−1 = (YL−1bL) eL−1 + 

 Θ(eL−1)  −1 uL−1 = (YL−1(κ (bL − bL) + bL)) eL−1 +  where Θ is the Heaviside step function. This formulation rescales the feedback from the output layer to account for κ, which otherwise would make the difference in burst probabilities of hidden layer l, pl −pl, become vanishingly small in earlier layers of the network. For units at all other hidden layers l, ul is given by:

 Θ(el)  ul = (Ylbl+1). el +  Appendix B. Appendix for Project 2 142

B.5.3 Training details

Training on MNIST, CIFAR-10 and ImageNet was done with PyTorch, using GPU nodes running on Compute Canada clusters. For the CIFAR-10 experiments, training examples were presented in batches of 32. Training images were randomly cropped and horizontally flipped, and normalized before being presented to the network. Images from the testing and validation sets were simply normalized before being presented. Training was done using stochastic gradient descent, with momentum of 0.9 and a weight decay of 10−6. When training on ImageNet, a batch size of 128 was used. Training images were randomly resized, cropped and horizontally flipped, and were normalized. Again, images from the testing and validation sets were only normalized before being presented. Training was done using stochastic gradient descent, with momentum of 0.9 and a weight decay of 10−4. When training on MNIST and CIFAR-10 using backprop, the standard mean-squared error (MSE) loss was used to update weights throughout the network. When training using burstprop, our weight update rules were chosen to descend the gradient of the MSE loss. The cross-entropy loss was used when training on ImageNet using backprop. Again, the burstprop weight update rules also descended the gradient of this loss function.

B.5.4 Hyperparameter optimization

For each network architecture and learning rule (backprop, burstprop with fixed feedback weights, or burstprop with learned feedback weights), separate hyperparameter optimization was done. Optimal learning rates, momentum and weight decay values were found using grid search. In the case of burstprop, separate learning rates for the output layer and hidden layers were optimized, due to the added sigmoid nonlinearity in the feedback received by hidden-layer units. Feedforward weights of ReLU layers were initialized from a normal distribution using Kaiming ini- tialization [29]. Xavier initialization was used in sigmoid layers [259]. Finally, the feedforward weights of the output softmax layer in the ImageNet network were drawn from a normal distribution with standard deviation of 0.001. In networks trained using burstprop, feedback weights were drawn from a normal distribution with a standard deviation that was chosen through the hyperparameter optimization process. For all hyperparameter optimization experiments, a randomly chosen subset of the training set was used as a validation set to measure performance. In the MNIST and CIFAR-10 experiments, 10,000 images were used for validation. In the ImageNet experiments, 50,000 images were used. Appendix B. Appendix for Project 2 143

B.6 Supplemental figures

Figure B.1: (Related to Fig. 4.4 of the main text) (a) Comparison of costs for the XOR task. In blue is the cost for the network in Fig. 4.4 in the main text, but 2000 neurons per population and slightly different parameter values. The dot-dashed pink line is for when the order of the examples are randomly selected within an epoch. The dotted red line has no plasticity in the hidden layer. The dashed green line is for 400 neurons per population. (b-e) Output event rate (ER) after learning. The dashed grey line separates ”true (1)” and ”false (0)” for the XOR. Only in c is XOR not solved. Appendix B. Appendix for Project 2 144

Figure B.2: (Related to Fig. 4.4 of the main text) (a) Comparison of costs for when the duration of examples T (in s) (dashed green line) and the moving average time constant τavg (in s) (dotted orange line) are changed with respect to the values used in Fig. 4.4 (solid blue). (b) Output event rate (ER) after learning for the three cases in panel a. The dashed grey line separates ”true (1)” and ”false (0)” for the XOR.

Figure B.3: (Related to Fig. 4.4 of the main text) Output-layer activity for the XOR task (b) when the feedback pathways are symmetric (a, L and L). Note that the XOR task is still solved. Only a single realization is displayed here. The symmetric feedback yields very similar representations at the hidden layer (c1-2) Appendix B. Appendix for Project 2 145

a b c (i) (i)

(ii) (ii)

(iii) (iii)

(iv) (iv)

(v)

Figure B.4: Learning MNIST with the time-dependent rate model. (a) Schematic of the network. The enlarged hidden layer population stresses the fact that the burst rate is equal to the event rate times the burst probability, with the event and burst probability nonlinearly integrating the feedforward and feedback signals, respectively. (b) Example event rates (i, iii, v) and weights (ii, iv) for two consecutive examples during the first epoch. In (i), the teacher is illustrated as a dashed line. Learning intervals are indicated by light green vertical bars. (c) Burst probabilities (i, iii) and differences of burst probabilities (ii, iv) for the same examples as in b. This network with a single hidden layer with 200 units has reached a test error ∼ 3% (not shown). Parameters: τv = τu = 0.1 s, τavg = 5 s, β = 5. Appendix B. Appendix for Project 2 146

a SOM-to-PC synaptic strength b VIP-mediated disinhibition

c Release probability PC-to-SOM d Dendritic excitability

Figure B.5: (Related to Fig. 4.5 of the main text) Network mechanisms regulating the burst- ing nonlinearity. All panels display the burst probability of a large population of two-compartment pyramidal neurons as a function of the intensity of the injected dendritic current. For each panel, in- creasing color intensities correspond to increasing values of the injected somatic current. The insets illustrate the microcircuit—including the PV-like neurons (discs) and the SOM-like neurons (inverted triangles)—and the component that is being modified is indicated by a colored circuit element. (a) Increasing the strength of inhibitory synapses from SOM neurons onto the pyramidal neurons’ den- drites produces divisive burst probability control. (b) Disinhibiting the pyramidal neurons’ dendrites by applying a hyperpolarizing current on the SOM neurons—mimicking inhibition from the VIP neurons— increases the slope. (c) Increasing the probability of release onto SOM neurons produces a small divisive gain modulation. (d) Increasing the dendritic excitability by increasing the strength of the regenerative dendritic activity produces an additive gain control. Appendix B. Appendix for Project 2 147

Figure B.6: (Related to Fig. 4.5 of the main text) The bursting nonlinearity controls the learning rate. (a) Schematics of the network. Each hidden layer had 500 units. The recurrent weights (Z(1) and Z(2)) and the feedback alignment weights (Y(1) and Y(2)) are explicitly represented. (b) Angle between the weight updates W(1) in the standard backpropagation algorithm and in burstprop for the MNIST digit recognition task. The angle is displayed for different values of the slope of the dendritic nonlinearity (β). Appendix B. Appendix for Project 2 148

Figure B.7: (Related to Fig. 4.6 of the main text) Linearity of feedback signals degrades with depth in deep convolutional network trained on ImageNet. Each plot shows the change in burst probability of a unit in hidden layer l, ∆pl, as the burst probability at the output layer, p8, is changed by ∆p8 (blue, top), as well as a random sample of 2000 burst probabilities after presentation of an input image (red, bottom). Appendix B. Appendix for Project 2 149

Figure B.8: (Related to Fig. 4.6 of the main text) Learning MNIST with the simplified rate model. A convolutional network whose architecture is described in Table B.3 was trained using backprop, feedback alignment, and burstprop. As in Figure 4.6a & c, recurrent input was introduced at hidden layers to keep burst probabilities linear with respect to feedback signals.

Figure B.9: The variance of the burst probability decreases during learning. (a) Variance of the burst probability as a function of the epoch for the MNIST task, for each layer in a network with 3 hidden layers with 500 units each. (b) Variance of the burst probability as a function of the test error, showing that the magnitude of the variance is correlated with the test error. Appendix B. Appendix for Project 2 150

Figure B.10: (Related to Fig. 4.3 in the main text) Recurrent short-term facilitating (STF) inhibitory connections within a pyramidal neuron population help disambiguate events and bursts. (a) Without STF inhibition. Left: A steady current is injected into the somata while a steady current of varying intensity is applied to the dendrites (see inset). The burst probability (BP) and the event rate (ER) are plotted. Three different somatic current intensities were tested (lighter curve = lower current intensity). Right: Same as the left-hand side, now with the dendritic current intensity fixed and a varying somatic current intensity. Preferably, the BP should not vary as a function of Isoma, as multiplexing hinges on the possibility to “orthogonalize“ the BP and ER responses to somatic and dendritic currents. (b) Same as a, but with STF inhibition. The rectangles emphasize the fact that this inhibition helps achieve a BP that is more independent of the injected somatic current. Appendix B. Appendix for Project 2 151

a Events b Bursts

presynaptic spikes presynaptic spikes

Figure B.11: (Related to Fig. 4.4 of the main text) Comparison between direct transmission of events and bursts and transmission mediated by short-term plasticity. (a) Event trans- mission. A single leaky integrate-and-fire (LIF) neuron receives spikes from two presynaptic pyramidal neurons (blue and yellow symbols). The solid purple curve represents the membrane potential of the LIF neuron when short-term depression filters the presynaptic spike trains to extract events. The solid red curve is the effect of a direct transmission of events, i.e., the synapse processes the presynaptic event-spikes directly. (b) Same as panel a, but for the transmission of bursts with short-term facilitation (STF). Appendix C

Appendix for Project 3

C.1 LIF neuron simulation details

Post-synaptic spike responses at feedforward synapses, p, were calculated from pre-synaptic binary spikes using an exponential kernel function κ:

X pj(t) = κ(t − t˜jk) (C.1) k

th where t˜jk is the k spike time of input neuron j and κ is given by:

−t/τL −t/τS κ(t) = (e − e )Θ(t)/(τL − τs) (C.2)

where τs = 0.003 s and τL = 0.01 s represent short and long time constants, and Θ is the Heaviside step function. Post-synaptic spike responses at feedback synapses, q, were computed in the same way.

C.2 RDD feedback training implementation

C.2.1 Weight scaling

Weights were shared between the convolutional network and the network of LIF neurons, but feedforward weights in the LIF network were scaled versions of the convolutional network weights:

LIF Conv 2 Wij = ψmWij /σW Conv (C.3)

where W Conv is a feedforward weight matrix in the convolutional network, W LIF is the corresponding weight matrix in the LIF network, m is the number of units in the upstream layer (ie. the number of Conv 2 Conv columns in W ), σW Conv is the standard deviation of W and ψ is a hyperparameter. This rescaling ensures that spike rates in the LIF network stay within an optimal range for the RDD algorithm to converge quickly, even if the scale of the feedforward weights in the convolutional network changes during training. This avoids situations where the scale of feedforward weights is so small that little or no spiking occurs in the LIF neurons.

152 Appendix C. Appendix for Project 3 153

C.2.2 Feedback training paradigm

The RDD feedback training paradigm is implemented as follows. We start by providing driving input to the first layer in the network of LIF neurons. To create this driving input, we choose a subset of 20% of the neurons in that layer, and create a unique input spike train for each of these neurons using a Poisson process with a rate of 200 Hz. All other neurons in the layer receive no driving input. Every 100 ms, a new set of neurons to receive driving input is randomly chosen. After 30 s, this layer stops receiving driving input, and the process repeats for the next layer in the network.

C.3 Network and training details

The network architectures used to train on Fashion-MNIST and CIFAR-10 are described in Table C.1.

Layer Fashion-MNIST SVHN & CIFAR-10 VOC Input 28 × 28 × 1 32 × 32 × 3 32 × 32 × 3 1 Conv2D 5 × 5, 64 ReLU Conv2D 5 × 5, 64 ReLU Conv2D 5 × 5, 64 ReLU 2 MaxPool 2 × 2, stride 2 MaxPool 2 × 2, stride 2 MaxPool 2 × 2, stride 2 3 Conv2D 5 × 5, 64 ReLU Conv2D 5 × 5, 64 ReLU Conv2D 5 × 5, 64 ReLU 4 MaxPool 2 × 2, stride 2 MaxPool 2 × 2, stride 2 MaxPool 2 × 2, stride 2 5 FC 384 ReLU FC 384 ReLU FC 384 ReLU 6 FC 192 ReLU FC 192 ReLU FC 192 ReLU 7 FC 10 ReLU FC 10 ReLU FC 21 ReLU

Table C.1: Network architectures used to train on Fashion-MNIST, SVHN, CIFAR-10 and VOC.

Inputs were randomly cropped and flipped during training, and batch normalization was used at each layer. Networks were trained using a minibatch size of 32.

C.4 Akrout et al. algorithm implementation

In experiments that compared sign alignment using our RDD algorithm with the [42] algorithm, we kept the same RDD feedback training paradigm (ie. layers were sequentially driven, and a small subset of neurons in each layer was active at once). However, rather than updating feedback weights using RDD, we recorded the mean firing rates of the active neurons in the upstream layer, rl, and the mean firing rates in the downstream layer, rl+1. We then used the following feedback weight update rule:

l (l+1)T ∆Y = ηr r − λWDY (C.4)

where Y are the feedback weights between layers l + 1 and l, and η and λWD are learning rate and weight decay hyperparameters, respectively.

C.5 Supplemental figures Appendix C. Appendix for Project 3 154

Figure C.1: Comparison of average spike rates in the fully-connected layers of the LIF network vs. activities of the same layers in the convolutional network, when both sets of layers were fed the same input. Spike rates in the LIF network are largely correlated with activities of units in the convolutional network.