Towards biologically plausible gradient descent
by
Jordan Guerguiev
A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Cell and Systems Biology University of Toronto
c Copyright 2021 by Jordan Guerguiev Abstract
Towards biologically plausible gradient descent
Jordan Guerguiev Doctor of Philosophy Graduate Department of Cell and Systems Biology University of Toronto 2021
Synaptic plasticity is the primary physiological mechanism underlying learning in the brain. It is de- pendent on pre- and post-synaptic neuronal activities, and can be mediated by neuromodulatory signals.
However, to date, computational models of learning that are based on pre- and post-synaptic activity and/or global neuromodulatory reward signals for plasticity have not been able to learn complex tasks that animals are capable of. In the machine learning field, neural network models with many layers of computations trained using gradient descent have been highly successful in learning difficult tasks with near-human level performance. To date, it remains unclear how gradient descent could be implemented in neural circuits with many layers of synaptic connections. The overarching goal of this thesis is to develop theories for how the unique properties of neurons can be leveraged to enable gradient descent in deep circuits and allow them to learn complex tasks.
The work in this thesis is divided into three projects. The first project demonstrates that networks of cortical pyramidal neurons, which have segregated apical dendrites and exhibit bursting behavior driven by dendritic plateau potentials, can in theory leverage these physiological properties to approximate gradient descent through multiple layers of synaptic connections. The second project presents a theory for how ensembles of pyramidal neurons can multiplex sensory and learning signals using bursting and short-term plasticity, in order to approximate gradient descent and learn complex visual recognition tasks that previous biologically inspired models have struggled with. The final project focuses on the fact that machine learning models implementing gradient descent assume symmetric feedforward and feedback weights, and presents a theory for how the spiking properties of neurons can enable them to align feedforward and feedback weights in a network.
As a whole, this work aims to bridge the gap between powerful algorithms developed in the machine learning field and our current understanding of learning in the brain. To this end, we develop novel theories into how neuronal circuits in the brain can coordinate the learning of complex tasks, and present a number of experimental predictions that are fruitful avenues for future experimental research.
ii Acknowledgements
I would like to extend my deep appreciation to my supervisor, Blake Richards, for putting his trust in me as one of his first Ph.D. students, and for providing me with boundless knowledge, support and encouragement that propelled me through this work. In addition, I am grateful for the help of my collaborators on the work presented here – Timothy Lillicrap, Alexandre Payeur, Friedemann Zenke, Richard Naud and Konrad Kording. I would also like to thank Thomas Mesnard, for his valuable help and advice along the way.
I would also like to thank my lab mates and friends I have made throughout the years, including Matt, Annik, Kirthana, Danny, Colleen and Luke, for sharing this experience with me, and bringing me countless moments of comfort, joy and laughter.
A special thanks to my committee members, Melanie Woodin, Frances Skinner, and Douglas Tweed, for giving me an abundance of valuable advice and suggestions that have helped me improve as a scien- tist, and shaped this body of work into what it is today.
I would like to thank Mao for her endless love, positivity and encouragement, for which I am forever grateful. Finally, I want to thank my parents, for the many sacrifices they have made to get me to this moment, and my sister, for always supporting me and being my mentor in life.
iii Contents
1 Introduction 1 1.1 Research contributions and thesis outline ...... 3
2 Background 5 2.1 Learning in the brain ...... 5 2.2 Biological neurons ...... 5 2.2.1 Inhibitory interneurons ...... 5 2.2.2 Pyramidal neurons ...... 6 2.3 Synaptic plasticity ...... 7 2.3.1 Short-term plasticity ...... 7 2.3.2 Long-term plasticity ...... 7 2.3.3 Hebbian plasticity, neuromodulation and synaptic tagging ...... 8 2.3.4 Spike timing dependent plasticity ...... 9 2.4 Machine learning ...... 9 2.4.1 Artificial neural networks ...... 10 2.4.2 Gradient descent ...... 10 2.4.3 Backpropagation of error (backprop) ...... 12 2.4.4 Convolutional neural networks ...... 12 2.5 Weight symmetry ...... 13 2.5.1 Feedback alignment ...... 13 2.5.2 Kolen-Pollack algorithm ...... 14 2.5.3 Weight mirroring ...... 14 2.6 Related models of biologically plausible gradient descent ...... 15 2.6.1 Contrastive Hebbian learning ...... 15 2.6.2 Equilibrium propagation ...... 15 2.6.3 Difference target propagation ...... 16 2.6.4 Dendritic prediction learning ...... 16 2.6.5 Dendritic error backpropagation ...... 17 2.6.6 Updated random feedback ...... 17 2.6.7 Burst ensemble multiplexing ...... 18 2.7 Project synopses ...... 18 2.7.1 Project 1: Towards deep learning with segregated dendrites ...... 18
iv 2.7.2 Project 2: Burst-dependent synaptic plasticity can coordinate learning in hierar- chical circuits ...... 19 2.7.3 Project 3: Spike-based causal inference for weight alignment ...... 19
3 Project 1: Towards deep learning with segregated dendrites 20 3.1 Abstract ...... 20 3.2 Author contributions ...... 21 3.3 Introduction ...... 21 3.4 Results ...... 24 3.4.1 A network architecture with segregated dendritic compartments ...... 24 3.4.2 Calculating credit assignment signals with feedback driven plateau potentials . . . 28 3.4.3 Co-ordinating optimization across layers with feedback to apical dendrites . . . . . 30 3.4.4 Deep learning with segregated dendrites ...... 33 3.4.5 Coordinated local learning mimics backpropagation of error ...... 35 3.4.6 Conditions on feedback weights ...... 37 3.4.7 Learning with partial apical attenuation ...... 38 3.5 Discussion ...... 39 3.6 Methods ...... 45 3.6.1 Neuronal dynamics ...... 45 3.6.2 Plateau potentials ...... 47 3.6.3 Weight updates ...... 47 3.6.4 Multiple hidden layers ...... 50 3.6.5 Learning rate optimization ...... 51 3.6.6 Training paradigm ...... 51 3.6.7 Simulation details ...... 52 3.7 Acknowledgments ...... 52
4 Project 2: Burst-dependent synaptic plasticity can coordinate learning in hierarchi- cal circuits 54 4.1 Abstract ...... 55 4.2 Author contributions ...... 55 4.3 Introduction ...... 55 4.4 Results ...... 57 4.4.1 A burst-dependent rule enables top-down steering of plasticity ...... 57 4.4.2 Dendrite-dependent bursting combined with short-term plasticity supports multi- plexing of feedforward and feedback signals ...... 60 4.4.3 Combining a burst-dependent plasticity rule with short-term plasticity and apical dendrites can solve the credit assignment problem ...... 61 4.4.4 Burst-dependent plasticity promotes linearity and alignment of feedback ...... 64 4.4.5 Ensemble-level burst-dependent plasticity in deep networks can support good per- formance on standard machine learning benchmarks ...... 66 4.5 Discussion ...... 69 4.6 Methods ...... 72 4.6.1 Spiking model ...... 72
v 4.6.2 Deep network model for categorical learning ...... 76 4.7 Acknowledgments ...... 79 4.8 Code availability ...... 80
5 Project 3: Spike-based causal inference for weight alignment 81 5.1 Abstract ...... 81 5.2 Author contributions ...... 82 5.3 Introduction ...... 82 5.4 Related work ...... 84 5.5 Our contributions ...... 84 5.6 Methods ...... 85 5.6.1 General approach ...... 85 5.6.2 RDD feedback training phase ...... 85 5.6.3 LIF dynamics ...... 86 5.6.4 RDD algorithm ...... 86 5.7 Results ...... 87 5.7.1 Alignment of feedback and feedforward weights ...... 87 5.7.2 Descending the symmetric alignment cost function ...... 87 5.7.3 Performance on Fashion-MNIST, SVHN, CIFAR-10 and VOC ...... 88 5.8 Discussion ...... 88
6 Discussion 92 6.1 Challenges and limitations ...... 93 6.1.1 Project 1: Towards deep learning with segregated dendrites ...... 93 6.1.2 Project 2: Burst-dependent synaptic plasticity can coordinate learning in hierar- chical circuits ...... 94 6.1.3 Project 3: Spike-based causal inference for weight alignment ...... 95 6.2 Experimental predictions ...... 96 6.2.1 Project 1: Towards deep learning with segregated dendrites ...... 96 6.2.2 Project 2: Burst-dependent synaptic plasticity can coordinate learning in hierar- chical circuits ...... 96 6.2.3 Project 3: Spike-based causal inference for weight alignment ...... 97 6.3 Future directions ...... 97 6.3.1 Learning modalities and network architectures ...... 97 6.3.2 Teaching and neuromodulatory signals ...... 98 6.3.3 Neuron types, connectivity motifs and cortical layers ...... 98 6.4 Concluding remarks ...... 99
Glossary 100
Bibliography 100
vi A Appendix for Project 1 119 A.1 Proofs ...... 119 A.1.1 Theorem for loss function coordination ...... 119 A.1.2 Hidden layer targets ...... 121 A.1.3 Lemma for firing rates ...... 121 A.2 Supplemental figures ...... 127
B Appendix for Project 2 129 B.1 Backprop ...... 129 B.2 Quasi-static burstprop ...... 130 B.2.1 Derivation ...... 130 B.3 Time-dependent rate model ...... 134 B.3.1 Dynamics ...... 134 B.3.2 Limiting case ...... 136 B.4 Linking the rate-based and spike-based models ...... 137 B.4.1 Linking the learning rules ...... 139 B.5 Models trained on MNIST, CIFAR-10 and ImageNet ...... 140 B.5.1 Model architectures ...... 140 B.5.2 Activation functions, burst probabilities and weight update rules ...... 141 B.5.3 Training details ...... 142 B.5.4 Hyperparameter optimization ...... 142 B.6 Supplemental figures ...... 143
C Appendix for Project 3 152 C.1 LIF neuron simulation details ...... 152 C.2 RDD feedback training implementation ...... 152 C.2.1 Weight scaling ...... 152 C.2.2 Feedback training paradigm ...... 153 C.3 Network and training details ...... 153 C.4 Akrout et al. algorithm implementation ...... 153 C.5 Supplemental figures ...... 153
vii List of Tables
A.1 List of parameter values used in simulations in Project 1 ...... 126
B.1 Comparison of bio-inspired credit-assignment algorithms ...... 131 B.2 Summary of the main equations of burstprop and backprop ...... 134 B.3 Network architectures used to train on MNIST, as well as CIFAR-10 and ImageNet exper- iments using backprop, node pertubation and burstprop with learned feedback weights, in Figs. 4.6 and B.8...... 140 B.4 Network architectures used to train on CIFAR-10 and ImageNet with fixed feedback weights in Fig. 4.6...... 140
C.1 Network architectures used in Project 3 ...... 153
viii List of Figures
1.1 The credit assignment problem in biological circuits involved in visual processing . . . . . 2
2.1 Feed-forward artificial neural networks ...... 11
3.1 The credit assignment problem in multi-layer neural networks ...... 22 3.2 Potential solutions to credit assignment using top-down feedback ...... 23 3.3 Illustration of a multi-compartment neural network model for deep learning ...... 25 3.4 Illustration of network phases for learning ...... 29 3.5 Co-ordinated errors between the output and hidden layers ...... 32 3.6 Improvement of learning with hidden layers ...... 34 3.7 Approximation of backpropagation with local learning rules ...... 36 3.8 Conditions on feedback synapses for effective learning ...... 38 3.9 Importance of dendritic segregation for deep learning ...... 39 3.10 An experiment to test the central prediction of the model ...... 42
4.1 The credit assignment problem for hierarchical networks ...... 57 4.2 Burst-dependent plasticity rule ...... 58 4.3 Dendrite-dependent bursting combined with short-term plasticity supports the simulta- neous propagation of bottom-up and top-down signals ...... 62 4.4 Burst-dependent plasticity can solve the credit assignment problem for the XOR task . . 63 4.5 Burst-dependent plasticity of recurrent and feedback connections promotes gradient-based learning by linearizing and aligning feedback ...... 67 4.6 Ensemble-level burst-dependent plasticity supports learning in deep networks ...... 68
5.1 Illustration of weight symmetry in a neural network with feedforward and feedback con- nections ...... 83 5.2 Outline of the feedback weight learning algorithm ...... 89 5.3 Feedback weights become aligned with feedforward weights during training ...... 90
5.4 Evolution of Rself during training ...... 91 5.5 Results of training on Fashion-MNIST, SVHN, CIFAR-10 and VOC ...... 91
A.1 Weight alignment during first epoch of training ...... 127 A.2 Learning with stochastic plateau times ...... 128 A.3 Importance of weight magnitudes for learning with sparse weights ...... 128
B.1 Comparison of costs for the XOR task ...... 143
ix B.2 Comparison of costs for different example durations and moving average time constants . 144 B.3 Output-layer activity for the XOR task ...... 144 B.4 Learning MNIST with the time-dependent rate model ...... 145 B.5 Network mechanisms regulating the bursting nonlinearity ...... 146 B.6 The bursting nonlinearity controls the learning rate ...... 147 B.7 Linearity of feedback signals degrades with depth in deep convolutional network trained on ImageNet ...... 148 B.8 Learning MNIST with the simplified rate model ...... 149 B.9 The variance of the burst probability decreases during learning ...... 149 B.10 Recurrent short-term facilitating (STF) inhibitory connections within a pyramidal neuron population help disambiguate events and bursts ...... 150 B.11 Comparison between direct transmission of events and bursts and transmission mediated by short-term plasticity ...... 151
C.1 Comparison of average spike rates in the fully-connected layers of the LIF network vs. activities of the same layers in the convolutional network, when both sets of layers were fed the same input ...... 154
x Chapter 1
Introduction
Synaptic plasticity is believed to be a key mechanism underlying learning and memory in the brain [1, 2]. In particular, plasticity in the form of long-term potentiation (LTP) and long-term depression (LTD) is implicated in the learning of behavioral tasks [3, 4, 5, 6, 7]. LTP and LTD are driven by characteristics of pre- and post-synaptic neuronal activities [8, 9, 10] and can also be affected by neuromodulators [11, 7, 12, 13, 14]. Hebbian plasticity is the long-standing theory that LTP arises when a pre-synaptic neuron repeatedly fires before the post-synaptic neuron [15]. A well-studied paradigm of Hebbian plasticity, spike-timing dependent plasticity (STDP), causes LTP or LTD at synapses depending on the timing of pre- and post-synaptic spike trains, and has been characterized for many neuron types and brain regions [16, 17, 18, 19]. Over the past few decades, Hebbian plasticity rules have been successfully used to capture a variety of phenomena observed in the brain [20, 21] and implement a number of unsupervised learning algorithms [22, 23, 24]. In order to account for the powerful learning capabilities observed in mammals, however, Hebbian learning rules must incorporate information about optimizing some ob- jective function [7, 25, 26, 27]. A global neuromodulatory signal is in theory sufficient for optimizing any behavioral task, but learning is very slow in deep networks, because a global reward signal alone provides very limited information to neurons earlier in the hierarchy [25, 28, 27]. Since simple Hebbian plasticity rules and plasticity rules based on a global reward signal do not appear to be sufficient for solving complex learning tasks, this suggests that upstream neurons in hierarchical circuits must receive top-down neuron-specific signals that guide their direction of plasticity (ie. whether to engage in LTP or LTD) in order to reduce their contribution to a global error. The signals that they receive should therefore reflect their causal effect on this error. This is known as the credit assignment problem – neurons in a hierarchical network should receive a signal of their credit for an error in the output, in order to undergo plasticity that reduces their contribution to this error (Figure 1.1). How the brain might solve the credit assignment problem remains an open question.
In the machine learning field, neural networks with many layers of processing units (deep neural networks) have been incredibly successful at learning difficult tasks, in some cases achieving human-level performance [29, 30, 31]. These networks are trained using gradient descent – given a cost function to minimize, weights throughout a neural network are updated in the opposite direction of the gradient of the cost function with respect to the weights. Gradient descent solves the credit assignment problem, since the gradient is by definition equal to the contribution of a weight to the cost function. Interestingly,
1 Chapter 1. Introduction 2
Figure 1.1: The credit assignment problem in biological circuits involved in visual processing. Visual sensory input (in this example, a wolf) is transformed through the hierarchy of the ventral stream into a representation in inferior temporal gyrus (IT) for “dog”, leading to the generation of error signals in downstream associative regions. Circles represent neurons, and filled circles indicate active neurons. Arrows between neurons represent synaptic connections. In order for useful learning to occur, upstream neurons need to know their contribution to downstream activity in order to undergo synaptic plasticity that leads to a decrease in future error signals. This is known as the credit assignment problem.
deep networks trained using gradient descent learn similar representations of stimuli to those observed in the neocortex [32, 33, 34, 35, 36]. This suggests that learning in the cortex involves something akin to gradient descent. However, a biological implementation of gradient descent is not straightforward, for numerous reasons [37, 38]. For one, the gradient terms that need to be calculated at a layer in the network hierarchy involve all of the downstream feedforward weights in the network. The most common gradient descent algorithm in machine learning, backpropagation of error, or backprop, uses the exact values of downstream weights to calculate the weight updates throughout the network. In a biological network, this would require neurons to receive information on all of the synaptic connec- tions downstream, which is highly implausible given our current understanding of neuronal physiology. The problem of communicating downstream feedforward weights to upstream neurons in a network is known as the weight transport problem. An alternative method to communicate downstream feedforward weights is to transmit error signals using a separate set of feedback connections with symmetric weights to the feedforward ones. Several models have addressed the problem of biologically plausible gradient descent using this approach [39, 40, 41]. However, this raises the non-trivial question of how symmetry between feedforward and feedback synapses can be achieved. In addition, some of these models imply a separate pathway for communicating error signals, in order to separate feedforward sensory signals from error signals used for plasticity [39, 42]. This would require each neuron in a network to have a one-to-one pairing with a feedback neuron. While possible, there is no evidence for this type of one-to-one pairing occurring in biological circuits. Finally, the communication of signed error signals using binary spiking neurons is not straightforward, and any theory for biologically plausible gradient descent should account for this. The overall aim of the work presented here is to develop computational theories that explore how the brain might solve these issues, and to present falsifiable predictions about neural physiology and circuitry that are fruitful avenues for future experimental research. Chapter 1. Introduction 3
1.1 Research contributions and thesis outline
This thesis explores how gradient descent can be approximated in a biologically plausible manner in deep hierarchical networks, while addressing (1) the issue of symmetric feedback connections, (2) the need for a separate pathway to communicate error signals, and (3) how signed error signals can be com- municated. The work presented in this thesis contributes to the growing body of research on biologically plausible gradient descent by developing novel models for how the brain can coordinate plasticity in deep hierarchical networks in a way that approximates gradient descent and enables learning of complex tasks. In addition, this work presents experimental predictions that can be investigated in future work.
The thesis is divided into three projects, each of which corresponds to a research paper that is either published or accepted for publication at a journal or conference:
• Project 1: Towards deep learning with segregated dendrites (published in eLife [43])
• Project 2: Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits (accepted for publication in Nature Neuroscience, available as a preprint [44])
• Project 3: Spike-based causal inference for weight alignment (published as a conference paper at ICLR [45])
The next chapter provides a background overview of the relevant neuroanatomical and machine learning concepts that inform this work, as well as other related models of biologically plausible gradient descent. In addition, a brief synopsis of each project is provided.
Chapter 3 describes Project 1, which presents a model that leverages properties of pyramidal cortical neurons, namely segregated apical dendrites and dendritic plateau potentials, to approximate gradient descent in a biologically plausible manner. This model uses reciprocal feedback connections between neurons with fixed, random weights, taking advantage of the feedback alignment effect [46] to overcome the weight transport problem. Segregated apical dendrites enable feedforward and feedback signals to be communicated without requiring a separate feedback pathway, and signed error signals are commu- nicated using the difference in dendritic plateau potentials.
Chapter 4 describes Project 2, which demonstrates that ensembles of pyramidal neurons can encode both sensory feedforward information and feedback error signals concurrently using bursting, and that this multiplexing can enable networks of ensembles to approximate gradient descent and solve difficult visual recognition tasks. A learning rule for feedback weights enables them to learn to become symmetric with feedforward weights, and signed error signals are communicated using the temporal difference in ensemble burst rates.
Note that the computational work described in Chapter 4 is broadly divided into two parts: a spik- ing simulation model (presented in Sections 4.4.1-4.4.4 and 4.6.1), which was developed by Alexandre Chapter 1. Introduction 4
Payeur, and a rate-based model (Sections 4.4.4 and 4.4.5 and 4.6.2), which was done by Jordan Guer- guiev and is the focus of this thesis.
Chapter 5 describes Project 3, which focuses on a learning rule for feedback weights that allows them to become symmetric with feedforward weights by taking advantage of the spiking properties of neurons. This is accomplished using a technique called regression discontinuity design (RDD), which allows a neuron to estimate its causal effect on the activities of downstream neurons by taking advantage of the binary spiking threshold.
Lastly, Chapter 6 discusses the overall conclusions, limitations and experimental predictions of the work presented in this thesis, as well as future research directions. Chapter 2
Background
2.1 Learning in the brain
The ability of an animal to learn, ie. to acquire knowledge about the external world and use this knowledge to adapt its behavior, is enabled by complex structures in the brain. In the neuroscience field, extensive attention has been given to studying the neural correlates of learning. The neocortex plays a critical role in enabling the powerful cognitive abilities of mammals, such as visual and auditory perception, motor control and language. Through experience with the external world, changes in neural activities in cortical regions enable mammals to acquire new cognitive abilities, such as successfully navigating a maze or learning to play the piano. What specific changes in neural activities enable the learning of new skills and how these changes occur is still an area of active experimental and theoretical research. What follows is a brief outline of the biological and theoretical knowledge that informs the work presented in this thesis, with a focus on cortical neurons.
2.2 Biological neurons
Neurons in the brain can be broadly classified into two types: excitatory and inhibitory. Excitatory neurons are those that have a depolarizing effect on post-synaptic neurons through the release of excita- tory neurotransmitters such as glutamate as result of an action potential, while inhibitory neurons cause post-synaptic hyperpolarization by releasing inhibitory neurotransmitters like gamma-Aminobutyric acid (GABA). Within each of these two broad classifications, a large variety of neurons can be found, in terms of morphology, physiology, connectivity and function.
2.2.1 Inhibitory interneurons
Inhibitory neurons comprise about 10-20% of the cortical neuronal population [47]. While they com- prise a relatively small subset of neocortical neurons, their axons project diffusely onto large amounts of neurons [48, 49, 50], and they play an important role in sensory processing, motor control and cognition [51]. In addition, inhibitory neurons are essential for maintaining balanced cortical activity [14].
Almost all neocortical inhibitory neurons are interneurons (neurons whose axons are limited to within a single brain area) [50]. These interneurons can be placed in three broad classes based on protein ex-
5 Chapter 2. Background 6 pression: parvalbumin-expressing (PV+) interneurons, somatostatin-expressing (SST+) interneurons, and serotonin receptor 5HT3a-expressing (5HT3aR+) interneurons [50]. 5HT3aR+ interneurons are fur- ther broken down into those that express vasoactive intestinal peptide (VIP), and those that do not.
Recent studies of these interneuron classes have found a set of consistent connectivity motifs through- out major neocortical areas [52, 50]. PV+ interneurons are mainly comprised of fast-spiking basket cells and chandelier cells, and mainly project into perisomatic and axonal regions of pyramidal neurons. SST+ interneurons are mostly comprised of Martinotti cells, which primarily project to apical dendrites of pyramidal neurons. Finally, VIP+ cells are bipolar or multipolar cells that provide disinhibition on excitatory cells by inhibiting PV+ and SST+ interneurons. Cortical inhibitory interneurons receive local inputs from pyramidal neurons and other interneurons, long-range thalamic inputs as well as inputs from other cortical regions [50, 52].
2.2.2 Pyramidal neurons
Pyramidal neurons are the most abundant type of excitatory neuron, found in the brains of all mam- mals, birds, fish and reptiles [53]. They are characterized by having two distinct dendritic domains, the basal and apical dendrites, as well as the pyramidal shape of the soma. In the neocortex, they are found in all layers except layer 1, in which apical dendritic tufts of pyramidal neurons in other layers are found. There are also differences in the inputs and outputs of pyramidal neurons found in different cortical layers. Layer 4 pyramidal neurons receive feedforward sensory input from thalamic projections, and project locally to layer 2/3, and to some extent layer 5 [53]. Layer 2/3 pyramidal neurons project locally to layer 5 neurons as well as to other cortical regions. Pyramidal neurons in layer 5 typically have long-range connections to other cortical regions, as well as to subcortical areas [54, 53]. Finally, layer 6 neurons receive directy thalamic input and provide long-range feedback to primary thalamic nuclei and intracortical projections to layer 4 [55, 56, 57]. Axons carrying feedback signals from higher order cortical areas project principally to layer 1 apical dendritic tufts, but also to layer 5 and 6 [58, 59].
Interestingly, pyramidal neurons have several unique properties that lead to complex behaviors, outlined below. As presented in this work, these properties can theoretically enable networks of pyramidal neurons to perform gradient descent.
Segregated dendritic compartments, action potentials and bursting
Pyramidal neurons in layer 5 of cortex, as well as in layer 2/3 and in hippocampus, have two distinct sets of dendrites: shorter basal dendrites near the soma, and longer apical dendrites that are electrotonically distant from the soma [60, 61, 62]. Several studies support the notion that basal dendrites of layer 5 pyramidal neurons typically receive bottom-up sensory input, while apical dendrites generally receive top-down feedback signals from higher-order processing areas [63, 64, 65, 53]. Patch-clamp studies have identified that apical dendrites of pyramidal neurons in layer 5 and in CA1 of hippocampus can generate supralinear dendritic Ca2+ spikes (also known as calcium plateau potentials) at an initiation zone near the main apical bifurcation [66, 61, 60]. These spikes can cause sufficient depolarization at the axonal action potential (AP) initiation zone to generate an action potential. Ca2+ spikes can be generated by coincident apical dendritic input and back-propagating APs [60, 61]. In addition, input at thin apical Chapter 2. Background 7 tuft dendrites can trigger N-methyl-D-aspartate (NMDA) spikes, which can in turn generate Ca2+ spikes and action potentials [61].
Notably, a back-propagating AP can initiate an additional Ca2+ spike, given sufficient apical input [61, 63]. The interaction between Ca2+ spikes and back-propagating APs leads to a series of high- frequency APs being generated at the soma, known as bursts. Ca2+ spikes and AP bursts have been implicated in the induction of long-term synaptic plasticity in pyramidal neurons [67, 68, 69].
2.3 Synaptic plasticity
Synaptic plasticity refers to changes in synaptic strength between neurons in the brain. Synaptic strength refers to the relative size of post-synaptic depolarizing or hyperpolarizing current that is generated by a pre-synaptic spike, and is dependent on many factors, including the volume of neurotransmitter released by the pre-synaptic cell and the types, states and amounts of post-synaptic receptors at the synapse. Synaptic plasticity has been observed in virtually every type of synapse in both invertebrates and mammals and is believed to be a key physiological substrate for learning and memory. There are two main types of synaptic plasticity: short-term plasticity, which refers to temporary presynaptic activity- dependent changes in synaptic strength that last on the order of tens of milliseconds to a few minutes, and long-term plasticity, which operates on much longer timescales (hours or more).
2.3.1 Short-term plasticity
Short-term plasticity (STP) includes short-term facilitation (STF) and short-term depression (STD). Because of their transient nature, STF and STD are not believed to be directly involved in learning and memory, but rather to act as adaptive mechanisms. STP usually occurs as a result of pre-synaptic mechanisms [70]. STD is most commonly characterized by a depletion of the readily-releasable pool (RRP) of synaptic vesicles and subsequent drop in neurotransmitter release, but can also arise from saturation or desensitization of post-synaptic receptors [70, 71]. The mechanisms underlying STF are less well-characterized, although various theories have been proposed, including increased neurotransmitter release due to residual Ca2+ in the pre-synaptic bouton, or the binding of Ca2+ to calcium buffers such as calbindin during the initial action potential, leading to an overall increase of Ca2+ reaching the neurotransmitter release sites with subsequent APs [70].
2.3.2 Long-term plasticity
Long-term plasticity, in the form of long-term potentiation (LTP) and long-term depression (LTD), involves persistent changes in the synaptic strength at a chemical synapse. This type of plasticity is considered to be the main mechanism for information storage and learning in the brain [1], and occurs at both excitatory and inhibitory synapses [72].
Most research of LTP has focused on the hippocampus, due to its importance for memory formation, but an additional body of work has taken a closer look at LTP in other cortical areas such as motor, auditory and visual cortex [69, 73]. LTP is mostly, if not exclusively, a post-synaptic phenomenon. At excitatory synapses, a widely studied form of LTP is mediated by the activation of NMDA ionotropic Chapter 2. Background 8 glutamate receptors (NMDARs) [9, 69, 74]. Influx of Ca2+ through NMDARs activates CaMKII proteins, which trigger phosphorylation of AMPARs and upregulate the trafficking of AMPARs to the membrane [9]. Likewise, LTD triggered by activation of NMDA receptors leads to dephosphorylation and removal of AMPA receptors from the synapse. Both NMDAR-mediated LTP and LTD are dependent on activation of both the pre- and post-synaptic cell, as NMDARs are activated only when glutamate is bound and there is sufficient depolarization to expel their magnesium block. Various forms of LTP and LTD have also been identified at inhibitory synapses [75, 76].
2.3.3 Hebbian plasticity, neuromodulation and synaptic tagging
A long-standing postulate in theoretical neuroscience, known as Hebb’s postulate, states that long-term synaptic plasticity is dependent on the co-occurrence of pre- and post-synaptic activity (“cells that fire together, wire together”). Experimental protocols for inducing LTP and LTD have demonstrated that LTP and LTD can be induced by activation of the pre-synaptic neuron and either activation or depolarization of the post-synaptic neuron [3]. Thus, a general Hebbian plasticity rule has the form:
dw ij = s (t)a (t) (2.1) dt i j
where wij is the strength of the synapse between pre-synaptic neuron j and post-synaptic neuron i, si is the state of the post-synaptic neuron (its membrane potential, firing rate, spike times, etc.), and aj is the spiking activity of the pre-synaptic neuron. A wealth of experimental evidence supports Hebb’s postulate – for example, both NMDAR-mediated LTP and LTD are dependent on coincident pre- and post-synaptic activity. Hebbian plasticity rules have been demonstrated to generate features seen in the cortex, such as ocular dominance and orientation columns seen in visual cortex [20], and are capable of learning unsupervised learning tasks. However, the simple two-factor Hebbian learning rule (where the two factors are the presynaptic and postsynaptic activities) is not sufficient for learning a behavioral objective, because it does not contain any term that reflects the direction of weight change that would reduce a learning-related error [27, 77]. To account for this, weight update rules incorporating a third factor reflecting some form of neuromodulatory signal that gates plasticity, have been proposed. These plasticity rules have the form:
dw ij = s (t)a (t)M(t) (2.2) dt i j where M(t) is a neuromodulatory signal, representing reward, the difference between reward and ex- pected reward, or saliency of an event. Experimentally, relationships between neuromodulatory activity and learning of behaviors have been identified, such as the importance of dopamine (DA) neurons and DA receptor activation on the learning of reward-motivated behavior [78, 79].
Another issue with two-factor correlative Hebbian plasticity rules is that the timescale of behavior and reward is much longer than that of purely Hebbian synaptic plasticity mechanisms [7]. The theory of synaptic tagging and capture (STC) addresses this issue, and recent experiments have found evidence supporting this theory in the hippocampus, cortex and striatum [73, 80, 81]. The STC theory posits that Hebbian mechanisms create a synaptic tag at synapses that decays at a behavioural timescale (seconds). However, the induction of this tag does not trigger plasticity in the form of LTP or LTD unless a global neuromodulatory signal is present, acting as a third factor. This process can be formalized using the Chapter 2. Background 9 following equations:
deij eij = si(t)aj(t) − (2.3) dt τe dw ij = e (t)M(t) (2.4) dt ij
where eij is the synaptic tag at the synapse between presynaptic neuron j and postsynaptic neuron i that decays with time constant τe, and M(t) is a modulatory signal that could be determined by neuromodulators such as dopamine, serotonin, acetylcholine or noradrenaline. Note that M(t) is not neuron-specific, since most neuromodulator-releasing neurons project diffusely onto a large amount of neurons [7]. STC provides a compelling theory for how reward-based learning may be implemented in the brain. Synaptic tags are analogous to eligibility traces in reinforcement learning models in machine learning. These are powerful models capable of reward-based learning in a number of difficult tasks, in some cases surpassing human performance [82, 31].
2.3.4 Spike timing dependent plasticity
The induction of LTP or LTD at a synapse, depending on the timing of pre- and post-synaptic spikes, is a well-studied form on Hebbian plasticity called spike-timing dependent plasticity (STDP). STDP has been observed in both excitatory and inhibitory synapses in many different neuron types throughout the brain [83, 16, 84, 17]. In a typical STDP protocol, pre-synaptic spikes are repeatedly paired with post-synaptic spikes with a fixed pre/post spiking interval. The spike interval can be positive, when the pre-synaptic neuron spikes before the post-synaptic neuron, or negative if the post-synaptic neuron spikes first. The change in synaptic strength following the induction protocol is then measured, and the protocol is repeated with a different pre/post interval. The window of pre/post timing that causes LTP or LTD can be very different depending on the brain region, and on whether the pre- and post-synaptic neurons are excitatory or inhibitory [16].
Distinct mechanisms for STDP induction have been characterized at excitatory and inhibitory synapses. At excitatory synapses, STDP depends on both NMDAR activation and a rise in post-synaptic Ca2+ level [85, 16], and has also been found to depend on pre-synaptic factors [86]. At GABAergic synapses, on the other hand, STDP appears to depend only on post-synaptic factors, and has been linked to influx of Ca2+ through L- and T-type voltage-gated Ca2+ channels and subsequent changes in the K+-Cl- co-transporter KCC2 state leading to a change in GABA reversal potential [87, 84, 88].
The traditional STDP protocol is a form of purely Hebbian plasticity, but recent work suggests that neuromodulatory signals can either gate STDP plasticity induction entirely or modify the STDP window [11, 7, 12, 13, 14], consistent with three-factor plasticity rules.
2.4 Machine learning
Machine learning is a branch of statistics and computer science that studies computational algorithms for learning patterns in data and making accurate predictions or decisions, without being explicitly Chapter 2. Background 10 programmed to perform the task. In recent years, machine learning models have been extremely suc- cessful at solving complex tasks related to visual processing [89, 90, 91], speech recognition [92], natural language processing [93, 94, 95], and control [31, 96, 97]. These models leverage large amounts of train- ing data related to the task at hand and fast processing power enabled by graphical processing units (GPUs). Machine learning models have successfully learned to perform image recognition, speech recog- nition, object detection, image generation, sentiment analysis, playing video and board games, driving autonomous vehicles, and many other tasks. Machine learning algorithms are broadly classified into three families: supervised learning (where a model is trained to perform a classification or regression task using an explicit teaching signal), unsupervised learning (where a model does not receive any teach- ing signal and is tasked with reducing the dimensionality of data by learning meaningful relationships in its features), and reinforcement learning (where a model receives a global reward or punishment signal and is tasked with learning a strategy for maximizing the received reward). The scope of this thesis includes only supervised learning algorithms, although the learning rules presented should be extendable to reinforcement learning and unsupervised learning domains.
2.4.1 Artificial neural networks
Artificial neural networks (ANNs) are a type of machine learning model that consists of a hierarchy of layers of computational units (Figure 2.1A). Each unit integrates its weighted inputs and applies an acti- vation function in order to produce a scalar output. Neural networks can be feedforward networks, where units in layer l provide input only to units in downstream layer l + 1, or recurrent networks, in which units in the same layer can provide input to each other or a unit can provide input to itself. Typically, feedforward networks are used to learn classification or regression tasks involving static inputs (such as images), while recurrent networks are well-suited to learning functions of sequential inputs (such as text documents or videos).
Non-linear activation functions are typically used in ANNs, such as the logistic (or sigmoid), hy- perbolic tangent, rectified linear unit (ReLU), or softmax functions, which allow them to approximate complex non-linear functions. ANNs with these activation functions are universal approximators, mean- ing they can approximate almost any continuous function [98]. This property, combined with powerful learning algorithms like backpropagation of error, has enabled their success at a wide variety of difficult learning tasks.
2.4.2 Gradient descent
Training of ANNs involves performing gradient descent on a cost function, L. In a supervised setting, this cost function is a measure of the error between the output of the network and the desired output, or target. For example, given an input x, the mean squared error (MSE) cost function measures the squared difference between the output and target:
n 1 X L = kˆy − y k2 (2.5) MSE n i i i=1
where n is the number of input samples, and yi and ˆyi are the output and target output for input sample i, respectively. The goal of training is to change the parameters in the network in order to Chapter 2. Background 11
Figure 2.1: A. A typical fully-connected feedforward artifical neural network, containing an input layer, a series of hidden layers, and an output layer. Circles denote units in each layer. Filled circles represent active units. Arrows represent feedforward connections between units. The output layer activity is used to compute the cost function L, which is the objective function that is minimized using gradient descent. B. A convolutional neural network utilizes convolutional layers. Each unit in these layers receives input only from a subset of the units in the previous layer (in this example, units in the first convolutional layer receive input from a subset of pixels in the input image). Dashed lines indicate units’ receptive fields. Chapter 2. Background 12 minimize this cost function. The solution cannot be solved analytically when non-linear activation functions are used, therefore the parameters in the network are updated in a stochastic manner, such that the value of the cost function decreases after every parameter change. In ANNs, the parameters that are updated are the weights between units. Weights are updated using gradient descent, which updates every weight w in the direction opposite to the gradient of the cost function with respect to the weight:
∂L ∆w ∝ − (2.6) ∂w Gradient descent is guaranteed to update weights in a way that reduces the cost function L.
2.4.3 Backpropagation of error (backprop)
The most common algorithm used to train neural networks using gradient descent is backpropagation of error, or backprop. Backprop takes advantage of the chain rule property of derivatives to optimize the computation of weight updates by avoiding redundant calculations. Weight updates are computed starting from the output layer of the network and iterating backward through the network. The previ- ously computed derivatives with respect to the units at layer l are used to compute the derivatives for layer l − 1:
l+1 ! ∂L X ∂yj δl ≡ = δl+1 (2.7) i ∂yl ∂yl j i j i
l l where yi and δi are the output and the error derivative of unit i in layer l, respectively. In a fully-connected feedforward neural network with linear activation functions, this is equivalent to:
l X l l+1 δi = wijδj (2.8) j
l where wij is the feedforward weight between unit i in layers l and unit j in layer l + 1.
2.4.4 Convolutional neural networks
Rather than using fully-connected layers throughout the network, convolutional neural networks (CNNs) introduce convolutional layers (Figure 2.1B). These layers perform convolution over activities in the pre- vious layer, rather than weighted integration. Each unit in a convolutional layer receives input only from a subset of the units in the previous layer. The development of convolutional layers was inspired by neurons in the visual processing regions of the brain, which exhibit retinotopically organized receptive fields. The receptive field of a neuron in the visual system is the subset of the visual space that, when stimulated, causes the neuron to respond. Receptive field size grows with the depth into the visual system hierarchy, such that neurons in V4 have larger receptive fields than V1 cells [99], a phenomenon replicated in convolutional networks. The introduction of CNNs was a breakthrough in visual processing that enabled much more efficient learning of visual recognition tasks. Chapter 2. Background 13
Unlike real neurons, convolutional layers in ANNs use shared weights, that is, all of the units in a convolutional layer integrate their inputs using the same set of weights. This dramatically reduces the number of parameters in the model, making learning much more efficient. Shared weights are clearly biologically implausible, but recent work demonstrates that CNNs trained without shared weights, known as locally-connected CNNs, can match the performance of traditional CNNs that use shared weights [100, 101].
2.5 Weight symmetry
Gradient descent requires the use of information about all downstream weights in the calculation of weight updates. This arises from the definition of the partial derivative (see equations 2.7 and 2.8). In neuroscience, this is known as the weight transport problem, because it is unclear how this information could be communicated given our current understanding of the brain. The weight transport problem can be solved if top-down feedback signals are communicated through feedback connections that have symmetric weights to the feedforward connections. However, the biological plausibility of symmetric feedforward and feedback connections, and how this symmetry can be achieved, remain unclear. Various algorithms have been proposed that deal with the issue of weight transport, some of which are outlined below.
2.5.1 Feedback alignment
Lillicrap et al. [46] identified that symmetric weights are not strictly necessary for gradient descent. Instead, a neural network that uses fixed, random feedback weights is able to solve machine learning tasks like image classification. In a network with fixed, random feedback weights, the feedback alignment (FA) effect causes feedforward weights to naturally become more aligned with feedback weights during the course of learning the task, in furtherance of minimizing the global cost function. While feedback alignment is able to match the performance of backprop on tasks like MNIST handwritten digit classifi- cation, further research has shown that it cannot successfully learn most difficult tasks requiring deeper convolutional networks, such as CIFAR-10 and ImageNet image classification [100, 102]. This perfor- mance gap is partially overcome if feedback weights are not necessarily symmetric, but match the sign of feedforward weights [103, 104, 102]. It is now clear that, while strict weight symmetry is not required for gradient descent, the degree of weight symmetry correlates with learning performance, and that strictly symmetric weights are the optimal regime [102, 100, 103, 105].
A related model to feedback alignment, called direct feedback alignment (DFA), uses a single set of error units that project back to all neurons in the network, rather than layer-by-layer sets of error units. This dramatically reduces the amount of error units required, especially in deep networks. However, like FA, DFA does not match the performance of backprop on difficult learning tasks [100, 102]. Below, two mechanisms for achieving weight symmetry by updating feedback weights are outlined. These operate in network architectures with either reciprocal layer-by-layer feedback connections, or layer-specific error units, as in FA. Future work could develop learning rules for feedback connections in DFA-like network architectures that enable weight symmetry. Chapter 2. Background 14
2.5.2 Kolen-Pollack algorithm
Kolen and Pollack [106] identified that if feedforward and feedback weights are updated using the same weight update term as well as a weight decay term, then they are guaranteed to converge. Suppose that weight updates are of the form:
∆W = A − λW (2.9) ∆Y = A − λY (2.10) (2.11)
where W is the matrix of feedforward weights between two layers in the network, Y is the matrix of reciprocal feedback weights, A is any matrix with the same shape, and 0 < λ < 1. These weight updates will cause W and Y to converge over time, leading to weight symmetry in a network. Networks trained using this algorithm are able to learn complex tasks where feedback alignment fails [42]. The work presented in Chapter 4 uses this form of weight updates in order to achieve symmetric weights.
2.5.3 Weight mirroring
Akrout et al. [42] present a mechanism for learning weight symmetry called weight mirroring. This mechanism is presented in a network with a separate error signaling pathway, with one error unit for each unit in the feedforward layers (however, this algorithm should in theory be compatible with networks using multi-compartment units instead of separate error-carrying units, as in chapters 3 and 4). The network alternates between two modes. During the first mode, called the engaged mode, input propagates forward through the feedforward layers to compute their activities, while the error signal at the output layer propagates backward through the error pathway:
yl+1 = σ(Wl+1yl) (2.12) δl = σ(Yl+1δl+1) (2.13)
where yl is the output of feedforward layer l, Wl+1 are the feedforward weights between layers l and l + 1, δl is the error signal at layer l, Yl+1 are the feedback weights between layers l + 1 and l, and σ is a nonlinear activation function. In the mirror mode, the output of a single layer l of the network is given by noise, and the error units at layers l and l + 1 are clamped such that δl = yl and δl+1 = yl+1. Then, feedback weights Yl+1 are updated using a Hebbian learning rule:
T ∆Yl+1 = ηδlδl+1 (2.14)
where η is a learning rate. The authors demonstrate that this learning rule allows a network to learn tasks where feedback alignment fails. Chapter 2. Background 15
2.6 Related models of biologically plausible gradient descent
In recent years, significant attention has been placed on developing algorithms for biologically plausible gradient descent, in part due to the severe limitations of Hebbian plasticity rules and the immense success of gradient descent-based models in the machine learning field. Below, a selection of these algorithms, some of which directly informed the work presented in this thesis, are briefly outlined.
2.6.1 Contrastive Hebbian learning
Contrastive Hebbian learning (CHL) is a biologically-inspired alternative to backprop that is based on Hebb’s rule. Unlike backprop, the activities of units exhibit temporal dynamics. In CHL, each feedfor- ward connection is matched with a reciprocal feedback connection between units, and units integrate both the feedforward and feedback signals they receive. In standard CHL, feedback weights are assumed to be symmetric to feedforward weights, although recent work has shown that CHL can work with fixed, random feedback weights on some tasks [107]. There are two phases of learning: a free phase, where the input to the network is clamped and propagates through the network until the network reaches a fixed point, and a clamped phase, where the input is clamped and the output units are clamped to the target, until the network reaches another fixed point. Weight updates are based on the difference in the products of pre- and post-synaptic steady-state activities of units in the two phases:
l ∆W ∝ ˆyl ⊗ ˆyl−1 − ˇyl ⊗ ˇyl−1 (2.15)
where ˆyl and ˇyl are the steady-state outputs at layer l in the free phase and clamped phase, re- spectively. CHL performs gradient descent [108], but the assumption of weight symmetry is biologically problematic, although random feedback weights have been shown to work with the model [107], and weight update rules on feedback weights that achieve symmetry are in theory compatible with the model. However, to date, CHL has not been shown to be able to scale to deep networks.
2.6.2 Equilibrium propagation
Scellier and Bengio [41] present an energy-based learning rule for neural networks that approximates backpropagation of error. In addition, this learning rule has been shown to be compatible with non- symmetric weights [109]. As in CHL, for each training example, the network alternates performs a free phase, where the input is clamped and the state of the network (ie. the activity of each unit in the network) converges to a fixed point called the free fixed point. This is followed by a weakly clamped phase, where the input is clamped and the output units are weakly clamped to move slightly closer to the target output, and the network converges to a weakly clamped fixed point. Weights in the network are then updated according to a function of the pre- and post-synaptic activities during the free and weakly clamped phases. Importantly, they prove that, under certain conditions, this weight update rule approximates gradient descent. Chapter 2. Background 16
2.6.3 Difference target propagation
Lee et al. [110] present a method of approximating gradient descent that constructs local loss functions at each layer in order to update both feedforward and feedback weights in a way that decreases the global loss. They define a ‘target’ at layer l as:
ˆyl = yl + φ(Yl+1ˆyl+1) − φ(Yl+1yl+1) (2.16)
where yl is the output of layer l, ˆyl is the target for layer l, Yl+1 is the set of feedback weights between layers l + 1 and l, and φ is a nonlinear function in the feedback path. Feedforward weights Wl l l l and feedback weights Y are updated to minimize the local loss functions L and Linv, respectively:
l l l 2 L = kˆy − y k2 (2.17) l l l l−1 l−1 2 Linv = kφ(Y σ(W ˆy + )) − (y + )k2 (2.18)
where Wl are the feedforward weights of layer l, σ is the nonlinear activation function in the feed- forward path, and is a noise term. They demonstrate that this model can match the performance of backprop on the MNIST classification task.
2.6.4 Dendritic prediction learning
Urbanczik and Senn [111] present a two-compartment dynamic model of a pyramidal neuron whose dendritic compartment can learn to reproduce a target signal injected into the somatic compartment. This is accomplished using a three-factor plasticity rule incorporating the pre- and post-synaptic activity as well as the dendritic membrane potential. The importance of dendritic potential as a third factor in plasticity has been suggested by several experimental studies [112, 113]. Inputs at the dendritic compartment with weights w produce voltage Vw(t), while the voltage at the soma, U, is driven by both the dendritic voltage and external nudging conductances. The weight of a dendritic synapse i is updated using the plasticity induction variable PIi(t), given by: