Biologically Realistic Artificial Neural Networks Matthijs Kok University of Twente P.O. Box 217, 7500AE Enschede The Netherlands [email protected]

ABSTRACT (computational models) for neural , to gain an un- Artificial Neural Networks (ANNs) play an important role derstanding of how biological systems, such as the , in machine learning nowadays. A typical ANN is tradi- work. Furthermore, also aims to make con- tionally trained using stochastic gradient descent (SGD) tributions towards progress in artificial intelligence, such with (BP). However, it is unlikely that as inspiration and validation for new computer algorithms a real biological neural network is trained similarly. Neu- and architectures [5]. To improve the understanding in roscience theories, such as could inspire both fields, models that better resemble biological neural adaption from the traditional training method SGD to networks are of utmost importance. make ANNs more biologically plausible. Two mathemati- ANNs are vaguely inspired by biological neural networks. cal descriptions of Hebbian theory will be suggested based However, the method SGD-BP that is used to train ANNs on the limitations of the mathematical framework of Heb- does not resemble the way in which the human brain learns. bian theory: A competitive Hebbian and an This research aims to contribute to closing the gap be- imply Hebbian learning rule. Finally, this research will tween artificial and biological neural networks, as it will propose a method, called the Hebbian Plasticity Term investigate a neuroscience theory called Hebbian theory (HPT) method, that incorporates the mathematical de- [16] as starting point to modify the traditional training scription of Hebbian theory to modify the traditional train- method SGD of ANNs aiming to make the learning pro- ing method SGD. Therefore two variants of the HPT- cess more biologically plausible. Within this research, two method are proposed: HPT-Competitive and HPT-Imply. mathematical formulations are proposed for different Heb- The influence of the hebbian plasticity term ϕ on SGD bian learning rules, which are incorporated in traditional shows more biologically realistic plasticity of a synaptic SGD by a newly developed method. The results of this connection between in the ANN at the cost of research can be used as a basis for developing more bio- performance. logically realistic artificial neural networks. This paper will investigate the following research ques- Keywords tions: Artificial Neural Network, Stochastic Gradient Descent RQ 1 How could Hebbian theory be used to mathemati- with Backpropagation, Hebbian Theory, Hebbian Plastic- cally describe a biologically realistic evolution of a ? ity Term method RQ 1.1 What are the limitations to Hebbian the- ory when using it to formulate a mathematical de- 1. INTRODUCTION scription of a weight update? Deep learning is a very important area in the study of RQ 1.2 Which types of mathematical descriptions machine learning nowadays, with wide applications [21, of Hebbian theory could be used, as a result of these 18, 10] such as speech recognition [9, 15], natural language limitations? processing [7], and image recognition [11, 32]. It describes RQ 2 Could the training algorithm BP-SGD of a typi- a powerful set of techniques for learning in artificial neural cal standard ANN be modified using Hebbian theory to networks (ANNs) [4, 25, 31]. A traditional method to train become a more biologically plausible training method? ANNs is stochastic gradient descent (SGD), which makes use of backpropagation (BP) [13, 25]. RQ 2.1 How could a mathematical description of Hebbian theory be incorporated into the standard In the biological sciences, understanding the biological ba- training algorithm BP-SGD? sis of learning is considered to be one of the ultimate chal- lenges [17]. Theoretical or computational neuroscientists RQ 2.2 Does a connection weight between two strive to make a link between observed biological pro- neurons of an artificial neural network evolve more cesses and biologically plausible theoretical mechanisms biologically plausible as a result of incorporating the new method? Permission to make digital or hard copies of all or part of this work for RQ 3 To what extent does the performance of an ANN personal or classroom use is granted without fee provided that copies trained using the newly developed more biologically realis- are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy oth- tic training method differ from the performance of an ANN erwise, or republish, to post on servers or to redistribute to lists, requires trained using the traditional training method BP-SG? prior specific permission and/or a fee. 33th Twente Student Conference on IT July 3rd, 2020, Enschede, The Netherlands. Copyright 2020, University of Twente, Faculty of Electrical Engineer- ing, Mathematics and Computer Science.

1 2. ANALOGY BETWEEN THE ANN AND of the pre-synaptic neurons, according to equation (1). THE BRAIN l l−1 l l aj = σ (ai · wji)+ bj (1) 2.1 Neurons in the brain " i # X Here, σ denotes a sigmoid-activation function and the bias l is unique to a and is denoted as bj . The weighted l−1 l l l input sum i(ai · wji)+ bj will be summarized as zj throughout the rest of this paper. Furthermore, the su- perscript isP written as l to denote any layer, as integer to denote a specific layer, or as L,L − 1,L − 2,.., where L denotes the last layer. Both, i and j denote the position of the neuron within a layer. The use of subscript letters will be alphabetically coherent with the order of layers l−2 l−1 l l+1 (e.g. ..., ah ,ai ,aj ,ak ,...). The dependence of the l post-synaptic neuron aj on multiple pre-synaptic neurons l−1 ai is also present in real biological post-synaptic neu- rons, which can be connected to multiple (pre-synaptic) Figure 1. Schematic image of two neurons in a neurons through their (multiple) dendrites, which deter- brain mine together whether or not an action potential is trig- gered. Figure 1 depicts two neurons in a brain that are separated An ANN shows connections as edges between nodes. The by a synapse. When the leftmost neuron, the pre-synaptic edge or connection is analogous to the synapse between l neuron, is excited by an action potential, the vesicles in its two neurons in a brain. In an ANN, a weight wji is asso- terminals fuse with its membrane and release their ciated to a connection and is considered to represent the contents, neurotransmitters, into the synaptic cleft. Once strength of the respective connection. In other words, a released, the neurotransmitters can bind to the receptors large weight in an ANN implies a large synaptic efficacy of on the dendrite of the rightmost neuron, the post-synaptic the analogous biological synapse. The subscript ji denotes neuron, acting as a chemical signal. That binding opens the position of the synapse within a layer. Here, j denotes ion channels on its dendrite that allow charged ions to the post- and i the pre-synaptic neuron, which together l flow in and out of the cell, converting the chemical signal form the synaptic connection wji. into an electrical signal. If the combined effect of multiple dendrites (which can be connected to other neurons) of 3. TRAINING AN ANN the post-synaptic neuron changes the overall charge of the A common ANN training method is stochastic gradient de- cell enough, then it triggers an action potential, also called scent (SGD), which requires backpropagation (BP). This a spike [24, 27]. section discusses how traditional SGD is performed on an 2.2 The relation between an ANN and neu- ANN, to serve as a basis for section 6 and 7. In an ANN, one feedforward is performed by computing rons in the brain l equation (1) sequentially for each layer. The weights wji A typical feedforward ANN could be described as a struc- l tured collection of nodes with edges between them, as de- and biases bj used to compute equation (1) are initial- picted in Figure 2. The ANN that is depicted in Fig- ized as random variables drawn from a standard normal distribution. Since the computation of each separate com- ponent is cumbersome, vector-wise multiplication is used. Below, matrix multiplication for a feed-forward is shown (in this example layer l has 2 units and layer l + 1 has 4 units):

l+1 l+1 l+1 l+1 z1 w11 w12 b1 l+1 l+1 l+1 l l+1 z2 w21 w22 a1 b2  l  =  l l  · l +  l  (2) z +1 w +1 w +1 a b +1 3 31 32  2  3  zl+1   wl+1 wl+1   bl+1   4   41 42   4        where l+1 l+1 a1 z1 l+1 l+1 a2 z2  l+1  = σ( l+1 ) (3) Figure 2. Example feedforward ANN with 2 neu- a3 z3  al+1   zl+1  rons in the input layer, 3 neurons in the hidden  4   4  layer, and 2 neurons in the output layer In short, this can be notated as:  l 1 l l 1 z + = W l+1 · a + b + (4) ure 2 represents connected neurons in a brain, as con- nected nodes. The spike in a neuron is considered a bi- and l 1 l 1 nary phenomenon: it is there or not. Therefore, the acti- a + = σ(z + ) (5) vation in an ANN represents the firing rate of the neuron. l l l l The more often a pre-synaptic neuron fires, the higher the Here, W is a matrix and a , b , z are vectors. chance that the post-synaptic neuron will also increase its The prediction or classification of a digit by the network is l L firing rate. The activation aj (of a node) in an ANN of a determined to be the highest activation aj in the output- l−1 post-synaptic neuron is dependent on the activation ai layer. To be able to improve the performance, the cost

2 function in equation (6) is calculated. To see how these chain rule expressions (10) and (11) are n determined, the dependence diagram [1] of Figure 3 is use- 1 L 2 C = · (a − yj ) (6) ful. 2 j j X Here, yj is a component of the vector of desired output activations y, also called the label. yj = 1 when the digit corresponding to output-neuron j is displayed on the im- age and yj = 0 otherwise. The value of the loss func- tion (6) is large for a big difference and small for a small difference between the output activation aL and the label y. Figure 3. Dependence in the ANN To be able to train the network, the weights and biases are modified to minimize the cost function, which is equivalent While propagating backwards through the network, terms to maximizing the performance. Since the cost function is of partial derivatives get longer in a consistent way. Gen- complex and multi-variate, it is very difficult to calculate erally, downstream (closer to the output layer) weight up- the global minimum. Therefore, a (possibly local) mini- dates have shorter terms, whereas upstream (closer to the mum is approached in steps by adjusting all weights and input layer) weight updates have longer terms. When biases in the direction of the negative gradient of the cost backpropagating, the partial derivatives of the cost func- function. This method is called gradient descent [8, 25]. tion with respect to each weight and bias are calculated The gradient ∇C of a function is the direction of steepest for each layer as a vector, based on the partial derivatives ascent at a point P . A point P means a value for each of of more downstream vectors as follows: l l the variables wji and bj . ∇C is a (column-)vector whose ∂C ′ − ∂C =1 · σ (zL 1) · (wL · ) (13) components are the partial derivatives of the cost func- L−1 i ji ∂bL ∂bi j j tion with respect to all weights and biases in the network, X shown in equation (7). In (per-layer) vectors, this can be translated into equa-

∂C ∂C ∂C ∂C T tion 14: ∇C(w,b)= ∂w1 ∂b1 ··· ∂wL ∂bL (7) 11 1 ji j ∂C ′ l−1 l T ∂C l−1 = σ (z ) ⊙ (W ) · l (14) Each step towardsh a minimum of the cost function,i the ∂b ∂b value of every weight and bias in the network is adjusted Here, the Hadamard product, or element-wise product, is according to equation (8) and (9): denoted as ⊙. ∂C The partial derivative with respect to the weight can then wl ←− wl − η · (8) ji ji l be computed based upon the following relation: ∂wji

∂C ∂C l−1 T ∂C l = l · (a ) (15) bl ←− bl − η · (9) ∂w ∂b j j ∂bl j 3.1 Mini-Batch SGD l ∂C Here, the step size or weight update ∆wji = η · l is A good middle ground between the stochastic gradient ∂wji determined by the partial derivative of the cost function descent algorithm (which picks one training example and ∂C updates immediately after) and the standard gradient de- with respect to that particular weight l and the learn- ∂wji ∂C scent algorithm (which updates only every epoch after ing rate η. The partial derivative l is the corresponding ∂wji calculating the average gradient over all training exam- component of the gradient of the cost function ∇C. The ples) is a form of stochastic gradient descent that is called learning rate η can be chosen to control the magnitude of ’mini-batch’ stochastic gradient descent [28, 25]. In mini- the step size. batch stochastic gradient descent, you pick a random set or ∂C ∂C ’batch’ of training examples. Subsequently, you compute To be able to compute ∇C, every component l an l ∂wji ∂bj the average gradient over that batch of training examples of ∇C must be computed. A component can be computed, and update using that gradient. Now, you have a nice using the Chain rule [35], according to equation (10) and combination of converging fast to a minimum while also equation (11). keeping a smooth line if you would plot the cost function l l over time. The reason for this is that you update more ∂C ∂z ∂a ∂C − ′ ∂C j j l 1 l often (after calculating the average gradient of the batch) l = l · l · l = ai · σ (zj ) · l (10) ∂wji ∂wji ∂zj ∂aj ∂aj and you average over a few training examples (those of the and the partial derivative with respect to a bias in layer l batch), respectively. Also, it is computationally less heavy, of the network becomes: because you only update all weights and biases after com- l l puting the average gradient over the training examples in ∂C ∂zj ∂aj ∂C ′ l ∂C the batch. l = l · l · l =1 · σ (zj ) · l (11) ∂bj ∂bj ∂zj ∂aj ∂aj where both in equation (10) and (11) holds that: 4. NEUROSCIENCE THEORIES l+1 l+1 4.1 Hebbian Theory ∂C ∂z ∂a ∂C ′ ∂C = k · k · = wl+1·σ (zl+1)· ∂al ∂al l+1 l+1 kj k l+1 The Hebbian theory, or Hebb’s rule, is a neuroscientific j j j ∂zk ∂ak j ∂ak X X theory that attempts to explain [16]. (12) The theory claims that if the pre-synaptic neuron repeat- l l l ∂zj l−1 ∂zj ∂aj ′ l edly fires and stimulates a post-synaptic neuron, some Also, note that l = ai , l = 1, l = σ (zj ), and ∂wji ∂bj ∂zj ∂C L growth process or metabolic change takes place in one or of the last layer L = aj − yj . ∂aj both cells that increases the efficiency of the pre-synaptic

3 neuron in firing the post-synaptic neuron. The theory out in [36] is that it appears unclear how the synaptic is often summarized as “Cells that fire together, wire to- plasticity afforded by the backpropagation algorithm could gether” [22]. However, Hebb implicitly describes that the be achieved in the brain, since biological change pre-synaptic cell should fire before, and not at the same their connection strength based solely on local signals, that time as, the post-synaptic cell. is, the activity of the neurons they connect. According to This firing causation mentioned by Hebb is described in equation (10) and (12), this is currently not the case since what is currently known as spike-timing-dependent plas- a weight update depends on multiple non-local and more ticity (STDP), where temporal precedence is required [6]. downstream terms. There are several studies that try to This is a biological process that adjusts the strength of overcome this [23, 26]. In [36] alternatives are suggested connections between neurons in the brain based on the such as temporal-error models like contrastive learning [30] relative timing of a neuron’s output and input action po- and an energy-based continuous update model [2], or ex- tentials. Resulting in pre-synaptic neurons causing the plicit error models like a predictive coding model [37] and a post-synaptic neuron’s excitation to be more likely to con- dendritic error model [29, 14]. However, all models intro- tribute in the future, whereas inputs that are not the cause duce new biologically implausible ideas, such as control of the post-synaptic spikes are made less likely to con- signals, phases, unrealistic architectures, and extra con- tribute in the future. Eventually, a subset of connections nections. between the pre- and post-synaptic neurons remain. Additionally, a biologically problematic aspect described According to equation (10), the “fire together, wire to- by [36] is the unrealistic model of a neuron in an ANN gether” postulate is already partly implemented in a tra- with regard to their continuous output. Real neurons use ditional ANN. This can be summarized into two points spikes. Sporea et al. [33] introduce a supervised learning [25]. First, the weight of a connection between two neu- algorithm for multilayer spiking neural networks (SNNs), l which do not fire at each propagation cycle (as in a typi- rons wji in an ANN will learn fast if the input neuron is l−1 cal ANN), but rather fire only when a membrane potential high-activation (ai ≈ 1) and slowly if it is low-activation l−1 ∂C reaches a specific value (the threshold). Hence, SNNs use (ai ≈ 0), because the magnitude of a weight update l ∂wji a binary output instead of a continuous output of ANNs. depends on the magnitude of the activation of the input Unfavorably, the current activation level of a neuron is l−1 neuron ai . Second, if the output neuron has already modeled as a differential equation in a SNN, which elimi- “learned”, increasing the weights to that output neuron nates training methods like BP-SGD. Furthermore, SNNs is not very useful anymore, because the neuron already have much larger computational costs than ANNs. l l reached its target value (e.g. aj ≈ 1 or aj ≈ 0). It ap- pears the weight of a connection between two neurons will 6. QUANTIFYING HEBBIAN THEORY learn slowly if the output neuron has saturated (i.e. is ei- ′ l In order to measure the influence of a Hebbian learn- ther high- or low-activation), because then σ (zj ) ≈ 0 and ∂C ′ l ing rule, the theory must be quantified mathematically. the weight update l depends on the size of σ (zj ). ∂wji Hebb’s rule says that the synaptic efficiency depends on the firing of the pre-synaptic neuron and the post-synaptic l 4.2 BCM Theory neuron. In other words, the weight wji of a connection l−1 The BCM theory, or BCM synaptic modification, is a depends on the activation of the pre-synaptic neuron ai l physical theory of learning in the visual cortex [3]. Part and the activation of the post-synaptic neuron aj . Since of the development of the BCM theory is based on limits Hebbian Theory only describes conditions for connection l of the Hebbian theory. Hence, Hebbian theory was nor- strengthening (wji increase), conditions should be added malized by introducing a loss in connection strength when that allow for weight-decay. there is no or unsynchronized activity, allowing for a decay To devise a formula for a weight update rule that adheres of synapses. to Hebbian Theory, it is easiest to think of activations in a 2 The BCM model states that synaptic plasticity is stabi- binary way. A possible set of conditions in all 2 possible lized by a dynamic adaptation of the time-averaged post- scenarios (when activations are binary) is the following: synaptic activity. According to the BCM model, a post- l−1 l l l synaptic neuron will tend to undergo long-term potenti- 1. ai =0 ∧ aj =0 =⇒ wji(t +1) = wji(t) ation (LTP) if it is in a high activity state when a pre- l−1 l l l 2. a =0 ∧ aj =1 =⇒ wji(t + 1) wji(t) lasting increase or decrease, respectively, in signal trans- In this paper, equation (16) is proposed that adheres to mission between two neurons. these conditions: l l−1 l l−1 l l 5. LIMITATIONS TO THE BIOLOGICAL wji(t +1)=(ai · aj −|ai − aj |) · c + wji(t) (16) PLAUSIBILITY OF AN ANN Where c is a real constant that can take on any (positive) Nowadays, there are still elements that are not biologi- value, and time t is indicated between brackets. cally plausible in an ANN, such as the weight decay and It is easily observed that equation (16) is not limited to temporal precedence described in BCM theory. There are binary activations, but also works for real activations. Al- implementations of algorithms that propose to take this though this is a way to quantify Hebbian learning for a into account, such as [34] that include STDP in ANNs. weight, it is not the only way. Many equations could be Another limitation to ANNs is the biologically implausi- adhering to the conditions stated above. ble backpropagation algorithm [20]. Whittingthon et al. Furthermore, the conditions itself are rather arbitrary. Only [36] explore this in a review article with a main focus on condition 4 is certain when talking about Hebbian learn- the lack of local error representation. What is pointed ing, because Hebbian theory doesn’t mention anything

4 about weight-decay. Hence, the resulting relation between which is also called ”spatial competition between conver- l l wji(t+1) and wji(t) could be diversified for condition 1-3. gent afferents”. BCM theory [3] tries to overcome this For example, another configuration of the conditions could by proposing a synaptic modification that results in tem- be one where condition 1, 3, 4 remain the same and con- poral competition between input patterns rather than a l−1 l l dition 2 is changed: ai = 0 ∧ aj =1 =⇒ wji(t +1) = spatial competition between different synapses. Whether l wji(t). This is inspired by the imply-operator in logic the synaptic strength increases or decreases depends on truth tables. In Table 1, the truth table for an imply op- the magnitude of the post-synaptic response as compared l−1 l erator is shown. Imagine p = ai and q = aj , where with a variable modification threshold: − (p =⇒ q) = 0 means a weight-decay and any other case w˙l = φ(al ) · al 1 − ǫ · wl (18) no weight decay. In this paper, equation (17) is proposed ji j i ji where l l Table 1. Truth table of logical implication φ(aj )= sign(aj − θw) (19) l p q p =⇒ q Here, φ(aj ) is a scalar function of the post-synaptic activ- l 0 0 1 ity aj that changes sign at a value θw of the output called 0 1 1 ’the modification threshold’. 1 0 0 l 1 1 1 l < 0 for aj <θw, φ(aj ) l (20) (> 0 for aj >θw that adheres to this configuration of conditions: l l l− l l− l The term −ǫw produces a uniform decay over all junc- w (t +1) = (2 · a 1 · a − a 1) · c + w (t) (17) ji ji i j i ji tions. This does not affect the behaviour of the system if ǫ For the sake of clarity, the configurations for the conditions is small enough. In [3] it was proposed to use the average l l l−1 of equation (16) and (17) could be translated to Table 2. output activation aj = i wji · ai , which averages over Fortunately, more of such configurations have been exten- l−1 a distribution of inputsPai , to determine the threshold θw: al Table 2. Conditions for equation (16) and (17) j p l θw =( ) · aj (21) pre post equation (16) equation (17) a0 − al 1 al wl (t + 1) ∼ wl (t) wl (t + 1) ∼ wl (t) i j ji ji ji ji where a0 and p are hyper-parameters. 0 0 = = 0 1 < = Although the quantifications of Hebbian theory and BCM theory can now be used to mathematically describe the 1 0 < < l 1 1 > > life-time of a connection strength wji, they cannot replace SGD directly. Just implementing the Hebbian rules in an ANN, described by equation (16) and (17), will cause sively explored in earlier work [12]. For example, another the weights to update accordingly, but the network will equation that adheres to the condition configuration ac- not learn. After training, the imply Hebbian learning rule cording to the last column of Table 2 that is proposed in − gains an accuracy of 0.09872 with a corresponding loss of [12] is wl (t +1)=(al − c) · al 1 + wl (t) for 0

5 ∂C A literal quote described in Hebb’s postulate is ”When an of the HPT ϕ on the weight update l and bias up- ∂wji axon of cell A is near enough to excite a cell B and re- ∂C date l during backpropagation. According to equa- peatedly or persistently takes part in firing it, some growth ∂bj ∂C process or metabolic change takes place in one or both cells tion (10), (11) and (12), both updates depend on l , ∂aj such that A’s efficiency, as one of the cells firing B, is l+1 ∂zk increased” [16]. Important here, is the word repeatedly. which in turn depends on l , and for the HPT method ∂aj This must mean that the synaptic efficiency is based on l+1 ∂zk l+1 l+1 a multitude of pre- and post-activations from the past. l = wkj + ϕkj ∂aj The synapse grows based on the history of the two cells that connect it. So somehow, a weight of a connection It is important that the HPT ϕ is located exactly where l it is in the feedforward equation (22). Locating it else- wji should be dependent on some Hebbian function of the l where would result in undesired behaviour. Consider zj = history of both pre- and post-synaptic activation. − wl · (al 1 + ϕ) + bl . If wl < 0 ∧ ϕl < 0 (where l l−1 l i ji i j ji ji ϕji = f((ai )avg, (aj )avg) (23) the weight should become smaller according to the Heb- P l l−1 l bian function f), then the resulting zj (which, according Here (ai )avg and (aj )avg are both a moving average over l all previous training examples that were already feedfor- to equation (1) in turn affects the output activation aj ) becomes larger since a positive number is obtained when warded. l l multiplying negatives (wji · ϕji) > 0. However, the reason To explain this idea, consider the current feedforward of a why ϕ obtained a negative value in the first place was to training example to be on time t, then the feedforward of decrease the efficiency of the synapse, so that the output the previous training example would be at time t−1. Using activation al will decrease. Also, when wl < 0 ∧ ϕl > 0, l l j ji ji the post-synaptic activation aj (t) to calculate the HPT ϕji l l l then (wji · ϕji) < 0 will make zj become smaller, which would not be possible at time t in the HPT method (22), l l is again not desired since zj should become bigger when because it is impossible for the HPT ϕji to be dependent l ϕ> 0. on the post-synaptic activation aj since information about l l aj is not available yet when calculating zj . This is a cir- cle argumentation. The weight cannot be dependent on 8. IMPLEMENTATION the post-synaptic activation as the post-synaptic activa- Algorithm 1 shows the pseudocode for the algorithm that tion is also dependent on that weight. This results in an was created as the implementation for backpropagating in infinite circular dependence, which is incontractible. How- the ANN. It is created using Python 3.6.2, with libraries ever, the post-synaptic activation of a previously feed- Numpy for vector calculations and Matplotlib.pyplot for forwarded training example could be used to compute the displaying results graphically. Equation (14) is incorpo- l HPT ϕji, since the feedforward of the training example at rated here on line 6. This code is used for the HPT time t − 1,t − 2, ..., or t − n (where n is the total num- method, where modifications to standard SGD are marked ber of completed feedforwarded training examples) is al- in cyan. ready completed and is therefore already known/available at time t. Algorithm 1: Backpropagation with HPT method l From this, an average post-synaptic activation (a )avg = j // Initialize ∂C and ∂C by the last layer al (t−1)+al (t−2)+...+al (t−n) ∂bL ∂wL j j j ∂C ′ L L n over the previous training ex- 1 ∂bL ← σ (z ) ⊙ (a − y) − amples can be computed. After every feedforward of a 2 ∂C ∂C L 1 T l ∂wL ← ∂bL · (a ) new training example, the average (aj )avg would slightly change, making it a moving average over all previously // Add both to the list feed-forwarded training examples. 3 ∂C gradient bias ← [ ∂bL ] As explained in section 6, there are several Hebbian rules 4 ∂C gradient weights ← [ L ] that could be chosen for f in equation (23), based on the ∂w configuration of conditions you choose. Therefore, two // Backpropagate through all layers starting variants of the HPT method will be proposed. In the first from the one-but last layer, since you variant, the HPT ϕ is based on a competitive Hebbian already added the last layer with learning rule. In this variant, the HPT ϕ adheres to the initialization conditions of equation (16) displayed in Table 2. This 5 for k ← 1 to N do ′ − − − version will be called HPT-Competitive: 6 ∂C L k L k+1 L k+1 T ∂C ∂bL ← σ (z ) ⊙ (W +ϕ ) · ∂bL − − l−1 l l−1 l 7 ∂C ← ∂C · (aL k 1)T f = ((ai )avg · (aj )avg −|(ai )avg − (aj )avg|) · c (24) ∂wL ∂bL 8 ∂C gradient bias.append( ∂bL ) In the second variant, the HPT ϕ is based on an imply 9 ∂C gradient weights.append( ∂wL ) Hebbian learning rule. In this variant, the HPT ϕ adheres 10 end to the conditions of equation (17) displayed in Table 2. This version will be called HPT-Imply: // Reverse the lists so they contain vectors − − from first to last layer instead of from f al 1 al al 1 c = (2 · ( i )avg · ( j )avg − ( i )avg) · (25) last to first layer 11 gradient bias.reverse() In both equation (24) and (25), c can take on any real 12 gradient weights.reverse() nonnegative value. Increasing c would mean to increase the influence of the Hebbian plasticity term ϕ on wnew in equation (22). In other words, it would mean to in- Important things to note are the following. All terms are crease the influence of the Hebbian plasticity term on the considered matrices. A vector is a column-vector (a ma- feedforward. trix with one column and multiple rows) unless referred to An increase in c also means an increase of the influence otherwise, like a row-vector.

6 ϕl in Algorithm 1 is calculated according to equation (24) and (25) for HPT-Competitive and HPT-Imply, respec- tively. Both variants of the HPT method need the moving l−1 l average of activations (ai )avg and (aj )avg of every layer in the network. This list of the moving average of ac- tivations for all per-layer (moving average of-)activation vectors is initialized as a list avg_acts. After every mini- batch, the average activation of the mini-batch for every (per-layer) vector, located in the list batch_avg_acts, is (a) Loss function (b) Accuracy added to every (per-layer) vector located in avg_acts. This is displayed in Algorithm 2. Interesting to note is Figure 4. HPT-Competitive method for different values of c Algorithm 2: avg acts calculation // Update moving average avg_acts after a batch 1 for act vec ∈ batch avg acts do 2 avg acts[act vec] += batch avg acts[act vec] 3 end 4 avg acts = [history / 2 for history in avg acts] 5

(a) Loss function (b) Accuracy that the moving average of all previous activations counts as much as the newly added batch-average of activations, since it is divided by 2. This way, the most recent activa- Figure 5. HPT-Imply method for different values of c tions in the network have the most influence on (a)avg and thus on ϕ. The source code for the complete project can be found on: https://github.com/matthijsruben/NeuralNets Table 3. Losses and Accuracy’s of the HPT method for different values of c 9. RESULTS HPT-Competitive HPT-Imply c Loss Accuracy Loss Accuracy 9.1 HPT method 0 0.01053 0.93537 0.01034 0.93675 In this section, the performances of novel methods pro- 0.1 0.01276 0.92770 0.01495 0.91547 posed in this paper will be shown and compared with the 0.5 0.04201 0.85137 0.49220 0.06995 performance of a standard ANN. All networks are trained 1 0.09267 0.48920 0.43276 0.07898 on the MNIST dataset [19], because of its large amount 5 0.10206 0.08398 0.37599 0.09192 of training examples. This dataset contains 28x28 pixel 10 0.33938 0.09112 0.38906 0.09083 (greyscale valued: 0-255) images of handwritten digits. The benchmark ANN from which the different methods will be evaluated is a four-layer deep feed-forward ANN from the HPT ϕ. with 784, 30, 20, and 10 neurons in each respective layer. Since neurons in this network are activated with a sigmoid It can be concluded that the overall performance of stan- function, the 784 greyscale pixel values of each image are dard SGD is better after 20 epochs than the overall perfor- downscaled to values between 0 and 1, and form the input mance of the HPT method. Perhaps more epochs will even 0 0 0 the performance between standard SGD and the HPT layer a1,a2,...,a784 of the network. The output layer con- tains ten neurons, since there are ten classes (10 different method. digits). The weights and biases of the network are initial- 9.2 Biologically plausible synaptic behaviour ized as random variables drawn from a standard normal distribution. The loss is calculated using a mean squared From the results, it can be observed that the HPT method error function, but for backpropagation equation (6) is trains slower than standard SGD. However, the synaptic change of a connection between two neurons (randomly used. The network is trained in 20 epochs using mini- 1 batch stochastic gradient descent with a batch-size of 28 chosen) in the network shows to behave more biologically training examples and a learning rate of 0.5. natural when using the HPT method (Competitive), as can be seen by comparing Figure 6a and 6b. Both graphs In Figure 4, the results of the HPT-Competitive method show a scatter plot, where each dot represents the value are shown. For larger values of c in equation (24), the HPT of an observed weight update of a connection between ϕ gains a larger influence over wnew, but it is observed to two neurons (pre- and post) by the ANN after training slow down learning. 5 epochs. The red dots show the positive weight updates In Figure 5, the results of the HPT-Imply method are and the blue dots the negative weight updates. In other shown. Again, larger values of c are observed to slow down words, a red dot represents a connection reinforcement and learning. The slowdown of learning in the HPT-Imply a blue dot represents a connection weakening. method even shows to be way more susceptible to an in- In Figure 6a, there is no clear relation visible between crease of c, than the slowdown of the HPT-Competitive the activation of the pre-synaptic neuron, the activation method. of the post-synaptic neuron, and the synaptic change (the

The results are summarized in Table 3 for different values 1 To generate these results, the activation of the first neu- of c. For c = 0, the HPT ϕ = 0 and the new weight equals l ron of the second layer and the activation of the first neu- the standard weight wnew = wji + 0. Therefore, standard ron of the third layer was chosen, as well as the weight SGD is applied when c = 0, because there is no influence updates of the connection between those neurons

7 Just like the authors of [3] found, a limit to Hebbian theory is that it only describes conditions for connection strength- ening. Therefore, conditions should be added that allow for weight decay. Consequently, two variants of a quantifi- cation of Hebbian theory are proposed: the competitive Hebbian learning rule, equation (16), and the imply Heb- bian learning rule, equation (17). Even though they are more biologically plausible, both variants don’t show good learning behaviour. The HPT method that is proposed in this paper modifies the standard training algorithm SGD that is commonly used for ANNs. Based on the proposed variants of Heb- bian learning, two variants to the HPT method are pro- posed: HPT-Competitive and HPT-Imply. By including the HPT ϕ in the feedforward equation (1), both the feed- (a) using standard SGD forward and backpropagation are shown to be influenced by Hebbian learning. Because of this, not only behaviour but also the way connections evolve (plasticity) is affected by Hebbian learning. This influence can be extended and controlled to a certain degree by parameter c, as shown in equation (24) and (25). The influence of the size of the parameter c of the HPT method is observed to slow down learning. However, the performance of the HPT-Competitive method is less sus- ceptible to the increase of c than the HPT-Imply method, because the HPT-Competitive method shows better per- formance than the HPT-Imply method for similar values of c. From this can be concluded that the type of Heb- bian learning rule used in the HPT method has a major influence on the performance of the ANN. The HPT-method makes connections behave and evolve (b) using the HPT method more biologically realistic at a slight loss in performance. When the activation of the pre- and post-synaptic neu- Figure 6. Synaptic change of the connection be- rons are visually observed to be related, the connection tween two neurons in the ANN between those neurons shows to be reinforced more often than weakened. When the activation of the pre- and post- synaptic neurons are visually observed to be unrelated, observed weight update). However in Figure 6b, a pattern the connection shows to be weakened more often than is observed. There are two distinctive groups: the group reinforced. Hence, there is a trade-off between the per- circled in red and the group circled in blue. The group formance and biologically realistic behaviour in the deep circled in blue contains mostly blue dots and the post- feedforward ANN that was used. synaptic activations are low whatever the activations of the pre-synaptic neuron: no influence is observed from 11. FUTURE WORK the pre- on the post-synaptic activation. So in most of As explored in [12], there are many more variants of Heb- the cases, the synaptic connection weakens if the pre- and bian learning and mathematical descriptions can vary end- post-activations are unrelated. The group circled in red lessly. Future research could try to compare multiple dif- contains mostly red dots. Also, visually some kind of linear ferent Hebbian learning rules and what their effect is on dependence between the pre-synaptic activation and the the performance of the HPT method, since they show to post-synaptic activation can be recognized, observed by be varyingly sensitive to the size of parameter c. the diagonal shape of the red group in the ’pre’ and ’post’ dimensions of Figure 6b. This means that in most of the Also, avg_acts from section 8 is now calculated based on cases the synaptic connection strengthens if the pre- and an unevenly weighted moving average, where the most re- post-activations are related. This is a desired behaviour cent activations have the most influence. It would be in- from a Hebbian (and therefore biological) point of view, teresting to see an implementation that compares the re- especially relating to its postulate ”Cells that fire together, sults of other types of moving averages, such as one that wire together”. is evenly weighted or exponentially weighted. The conclusions are solely based on visual observation of the graphs in Figure 6, since no statistical significance test 12. ACKNOWLEDGMENTS has been performed on these results. However, they do I want to thank my supervisor Elena Mocanu, who guided provide an intuition as to why the HPT method causes me during my research, provided me with so much helpful synapses to behave more biologically plausible. and extensive feedback, and always wanted to get the best out of me. 10. CONCLUSIONS In order to close the gap between artificial and biological neural networks, this paper aimed to incorporate mathe- matical translations of Hebbian theory into SGD.

8 13. REFERENCES [19] Yann LeCun, Corinna Cortes, and Chris Burges. [1] 3BLUE1BROWN. Backpropagation calculus | Deep MNIST handwritten digit database. learning, chapter 4 - YouTube, 11 2017. [20] Timothy P. Lillicrap, Adam Santoro, Luke Marris, [2] Yoshua Bengio, Thomas Mesnard, Asja Fischer, Colin J. Akerman, and Geoffrey Hinton. Saizheng Zhang, and Yuhuai Wu. STDP-compatible Backpropagation and the brain. Nature Reviews approximation of backpropagation in an Neuroscience, pages 1–12, 4 2020. energy-based model. Neural Computation, [21] Weibo Liu, Zidong Wang, Xiaohui Liu, Nianyin 29(3):555–577, 3 2017. Zeng, Yurong Liu, and Fuad E. Alsaadi. A survey of [3] Elie L. Bienenstock, Leon N. Cooper, and Paul W. deep neural network architectures and their Munro. Theory for the development of neuron applications. Neurocomputing, 234:11–26, 4 2017. selectivity: Orientation specificity and binocular [22] Siegrid L¨owel and Wolf Singer. Selection of intrinsic interaction in visual cortex. Journal of Neuroscience, horizontal connections in the visual cortex by 2(1):32–48, 1982. correlated neuronal activity. Science, [4] Christopher M. Bishop. Pattern Recognition and 255(5041):209–212, 1 1992. Machine Learning. Springer (India) Private Limited, [23] Hesham Mostafa, Vishwajith Ramesh, and Gert New York, 2006. Cauwenberghs. Deep supervised learning using local [5] Rodney Brooks, Demis Hassabis, Dennis Bray, and errors. Frontiers in Neuroscience, 12(AUG), 11 2017. Amnon Shashua. Turing centenary: Is the brain a [24] Neuroscientifically Challenged. 2-Minute good model for machine intelligence? Nature, Neuroscience: Synaptic Transmission - YouTube, 7 482(7386):462–463, 2 2012. 2014. [6] Natalia Caporale and Yang Dan. Spike [25] Michael A. Nielsen. Neural Networks and Deep Timing–Dependent Plasticity: A Hebbian Learning Learning, 2015. Rule. Annual Review of Neuroscience, 31(1):25–46, 7 [26] Randall C. O’Reilly. Biologically Plausible 2008. Error-Driven Learning Using Local Activation [7] Ronan Collobert and Jason Weston. A unified Differences: The Generalized Recirculation architecture for natural language processing: Deep Algorithm. Neural Computation, 8(5):895–938, 7 neural networks with multitask learning. In 1996. Proceedings of the 25th International Conference on [27] Osmosis. Neuron action potential - physiology - Machine Learning, pages 160–167, 2008. YouTube, 12 2016. [8] Haskell B. Curry. The method of steepest descent [28] Sushant Patrikar. Batch, Mini Batch & Stochastic for non-linear minimization problems. Quarterly of Gradient Descent - Towards Data Science, 10 2019. Applied Mathematics, 2(3):258–261, 10 1944. [29] Jo˜ao Sacramento, Rui Ponte Costa, Yoshua Bengio, [9] Li Deng, Geoffrey Hinton, and Brian Kingsbury. and Walter Senn. Dendritic cortical microcircuits New types of deep neural network learning for approximate the backpropagation algorithm. speech recognition and related applications: An Technical report, 2018. overview. In ICASSP, IEEE International [30] Benjamin Scellier and Yoshua Bengio. Equilibrium Conference on Acoustics, Speech and Signal Propagation: Bridging the Gap between Processing - Proceedings, pages 8599–8603, 10 2013. Energy-Based Models and Backpropagation. [10] Li Deng and Dong Yu. Deep learning: Methods and Frontiers in Computational Neuroscience, 11:24, 5 applications. Foundations and Trends in Signal 2017. Processing, 7(3-4):197–387, 2014. [31] Jurgen¨ Schmidhuber. Deep Learning in neural [11] Jeff Donahue, Lisa Anne Hendricks, Marcus networks: An overview. Neural Networks, 61:85–117, Rohrbach, Subhashini Venugopalan, Sergio 1 2015. Guadarrama, Kate Saenko, and Trevor Darrell. [32] Karen Simonyan and Andrew Zisserman. Very deep Long-Term Recurrent Convolutional Networks for convolutional networks for large-scale image Visual Recognition and Description. IEEE recognition. In 3rd International Conference on Transactions on Pattern Analysis and Machine Learning Representations, ICLR 2015 - Conference Intelligence, 39(4):677–691, 4 2017. Track Proceedings. International Conference on [12] Wulfram Gerstner and Werner M. Kistler. Learning Representations, ICLR, 2015. Mathematical formulations of Hebbian learning. [33] Ioana Sporea and Andr´eGruning.¨ Supervised Biological Cybernetics, 87(5-6):404–415, 2002. learning in multilayer spiking neural networks. [13] Ian Goodfellow, Yoshua Bengio, and Aaron Neural Computation, 25(2):473–509, 2013. Courville. Deep Learning. MIT Press, 2016. [34] Roland A Suri. A computational framework for [14] Jordan Guerguiev, Timothy P. Lillicrap, and cortical learning. Biological Cybernetics, 90:400–409, Blake A. Richards. Towards deep learning with 2004. segregated dendrites. eLife, 6, 12 2017. [35] George B. Thomas, Joel. Hass, and Maurice D. [15] Prashanth Gurunath Shivakumar and Panayiotis Weir. Thomas’ calculus. Pearson Addison Wesley, 13 Georgiou. Transfer learning from adult to children edition, 2 2014. for speech recognition: Evaluation, analysis and [36] James C R Whittington and Rafal Bogacz. Theories recommendations. Computer Speech and Language, of Error Back-Propagation in the Brain. Trends in 63, 9 2020. Cognitive Sciences, 23(3), 2019. [16] Donald Olding Hebb. The Organization Of [37] James C.R. Whittington and Rafal Bogacz. An Behavior. Wiley & Sons, New York, 1949. approximation of the error backpropagation [17] Eric R Kandel. Principles of Neural Science. algorithm in a predictive coding network with local McGraw-Hill Education, 5 edition, 2012. hebbian synaptic plasticity. Neural Computation, [18] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 29(5):1229–1262, 5 2017. Deep learning. Nature, 521(7553):436–444, 5 2015.

9