Towards Biologically Plausible Deep Learning

Yoshua Bengio March 12, 2015

NYU

Yoshua Bengio, Dong-Hyun Lee, Jorg Bornschein, and Zhouhan Lin, arXiv 1502.04156 Central Issue in Deep Learning: Credit Assignment

• What should hidden layers do?

• Established approaches: • Backpropagaon • Stochasc relaxaon in Boltzmann machines Issues with Back-Prop • Over very deep nets or recurrent nets with many steps, non- linearies compose and yield sharp non-linearity à gradients vanish or explode • Training deeper nets: harder opmizaon • In the extreme of non-linearity: discrete funcons, can’t use back-prop • Biological plausibility

¢ … ¢ = Biological Plausibility Issues with Standard Backprop

1. BP of gradient = purely linear computaon, not plausible across many neural levels 2. If feedback paths are used for BP, how would they know the precise derivaves of forward-prop? 3. Feedback paths would have to use exactly the same weights (transposed) as feedforward paths 4. Real neurons communicate via spikes 5. Need to clock and alternate feedforward and feedback computaon 6. Where would the supervised targets come from?

4 Issues with Boltzmann Machines • Sampling from the MCMC of the model is required in the inner loop of training • As the model gets sharper, mixing between well-separated modes stalls

Training updates vicious circle Mixing

5 What is the brain’s learning algorithm? Cue: Spike-Timing Dependent Plasticity

• Observed throughout the nervous system, especially in cortex • STDP: weight increases if post- spike just aer pre-spike, decreases if just before.

6 Towards Biologically Plausible Deep Learning here. discussed in Sec. 5 these alternative interpretations of the model provide different ways to sample from it, and we Note that back-propagation is used not just for classical su- found that better samples could be obtained. pervised learning but also for many unsupervised learning algorithms, including all kinds of auto-encoders: sparse 2. STDP as Stochastic Gradient Descent auto-encoders (Ranzato et al., 2007; Goodfellow et al., Spike-Timing-Dependent Plasticity or STDP is believed to 2009), denoising auto-encoders (Vincent et al., 2008), be the main form of synaptic change in neurons (Markram contractive auto-encoders (Rifai et al., 2011), and more & Sakmann, 1995; Gerstner et al., 1996) and it relates the recently, variational auto-encoders (Kingma & Welling, expected change in synaptic weights to the timing differ- 2014). Other unsupervised learning algorithms exist which ence between post-synaptic spikes and pre-synaptic spikes. do not rely on back-propagation, such as the various Boltz- Although it is the result of experimental observations in mann machine learning algorithms (Hinton & Sejnowski, biological neurons, its interpretation as part of a learning 1986; Smolensky, 1986; Hinton et al., 2006; Salakhutdinov procedure that could explain learning in deep networks re- & Hinton, 2009). Boltzmann machines are probably the mains unclear. This paper aims at proposing such an in- most biologically plausible learning algorithms for deep ar- terpretation, starting from the proposal made by Hinton chitectures that we currently know, but they also face sev- (2007), but extending these ideas towards unsupervised eral question marks in this regard, such as the weight trans- generative modeling of the data. port problem ((3) above) to achieve symmetric weights, and the positive-phase vs negative-phase synchronization ques- What has been observed in STDP is that the weights change tion (similar to (5) above). if there is a pre-synaptic spike in the temporal vicinity of a post-synaptic spike: that change is positive if the post- Our starting point (Sec. 2) proposes an interpretation of the synaptic spike happens just after the pre-synaptic spike, main learning rule observed in biological synapses: Spike- negative if it happens just before. As suggested in Hin- Timing-Dependent Plasticity (STDP). Following up on the ton’s talk, this also corresponds to a temporal derivative fil- ideas presented in Hinton’s 2007 talk (Hinton, 2007), we ter applied to the post-synaptic firing rate, at the time of the first argue that STDP could be seen as stochastic gradient pre-synaptic spike. To illustrate this, consider the situation descent if only the neuron was driven by a feedback signal in which two neurons Ni and Nk impinge on neuron Nj, that either increases or decreases the neuron’s firing rate and each neuron, say Ni, has a voltage potential Vi which, in proportion to the gradient of an objective function with when above threshold, makes the neuron send out a spike respect to the neuron’s voltage potential. Si with probability (called rate) Ri. If Rk increases after a In Sec. 3 we then argue that the above interpretation spike Si, in average (over Sk), that will increase Vj and Rj suggests that neural dynamics (which creates the above and thus the probability of Nj’s post-synaptic spike. That changes in neuronal activations thanks to feedback and lat- will come right after the Ni spike, yielding an increase in eral connections) correspond to inference towards neural the synaptic weight Wij as per STDP. However, if Rk de- configurations that are more consistent with each other and creases after a spike Si, this decreases the probability of Nj with the observations (inputs, targets, or rewards). This spikingMachine after NLearningi’s spike, or Interpretation equivalently, making of the prob- view extends Hinton’s supervised learning proposal to the abilitySpike-Timing of Nj’s spike occuringDependent before NPlasticityi’s spike larger than unsupervised generative setting. It naturally suggests that the• probabilitySuggested by ofXieN & jSeung’s spike NIPS’99 and Hinton 2007: the STDP occuring after Ni’s spike, i.e., the training procedure corresponds to a form of variational makingupdate corresponds to a temporal derivave filter applied to Vj and Rj smaller. According to STDP, this situ- EM (Neal & Hinton, 1999) (see Sec.3), possibly based on ationpost-spike, around pre-spike. would then correspond to a decrease in the synaptic MAP (maximum a posteriori) or MCMC (Markov Chain weight. In conclusion, these arguments suggest that STDP Monte-Carlo) approximations. In Sec. 4 we show how this can• In agreement with the above, we argue this corresponds to be interpreted as follows: mathematical framework suggests a training procedure for a deep generative network with many layers of latent vari- W S V , (1) ij / i j ables. However, the above interpretation would still require synapc pre-spike temporal change in change post-potenal to compute some gradients. Another contribution (Sec. 6) where indicates the temporal change, Si indicates the is to show that one can estimate these gradients via an ap- pre-synaptic spike (from neuron i), and Vj indicates the proximation that only involves ordinary neural computa- post-synaptic voltage potential (of neuron j). tion and no explicit derivatives, following previous work 7 Clearly, the consequence is that if the change Vj cor- on target propagation (Bengio, 2014; Lee et al., 2014). responds to improving some objective function J, then Although our primary justification for the proposed learn- STDP corresponds to approximate stochastic gradient ing algorithm corresponds to a deep directed graphical descent in that objective function. With this view, STDP model, it turns out that the proposed learning mechanism would implement the delta rule (gradient descent on a one- can be interpreted as training a denoising auto-encoder. As layer network) if the post-synaptic activation changes in the direction of the gradient. Towards Biologically Plausible Deep Learning here. discussed in Sec. 5 these alternative interpretations of the model provide different ways to sample from it, and we Note that back-propagation is used not just for classical su- found that better samples could be obtained. pervised learning but also for many unsupervised learning algorithms, including all kinds of auto-encoders: sparse 2. STDP as Stochastic Gradient Descent auto-encoders (Ranzato et al., 2007; Goodfellow et al., Spike-Timing-Dependent Plasticity or STDP is believed to 2009), denoising auto-encoders (Vincent et al., 2008), be the main form of synaptic change in neurons (Markram contractive auto-encoders (Rifai et al., 2011), and more & Sakmann, 1995; Gerstner et al., 1996) and it relates the recently, variational auto-encoders (Kingma & Welling, expected change in synaptic weights to the timing differ- 2014). Other unsupervised learning algorithms exist which ence between post-synaptic spikes and pre-synaptic spikes. do not rely on back-propagation, such as the various Boltz- Although it is the result of experimental observations in mann machine learning algorithms (Hinton & Sejnowski, biological neurons, its interpretation as part of a learning 1986; Smolensky, 1986; Hinton et al., 2006; Salakhutdinov procedure that could explain learning in deep networks re- & Hinton, 2009). Boltzmann machines are probably the mains unclear. This paper aims at proposing such an in- most biologically plausible learning algorithms for deep ar- terpretation, starting from the proposal made by Hinton chitectures that we currently know, but they also face sev- (2007), but extending these ideas towards unsupervised eral question marks in this regard, such as the weight trans- generative modeling of the data. port problem ((3) above) to achieve symmetric weights, and the positive-phase vs negative-phase synchronization ques- What has been observed in STDP is that the weights change tion (similar to (5) above). if there is a pre-synaptic spike in the temporal vicinity of a post-synaptic spike: that change is positive if the post- Our starting point (Sec. 2) proposes an interpretation of the synaptic spike happens just after the pre-synaptic spike, main learning rule observed in biological synapses: Spike- negative if it happens just before. As suggested in Hin- Timing-Dependent Plasticity (STDP). Following up on the ton’s talk, this also corresponds to a temporal derivative fil- ideas presented in Hinton’s 2007 talk (Hinton, 2007), we ter applied to the post-synaptic firing rate, at the time of the first argue that STDP could be seen as stochastic gradient pre-synaptic spike. To illustrate this, consider the situation descent if only the neuron was driven by a feedback signal in which two neurons Ni and Nk impinge on neuron Nj, that either increases or decreases the neuron’s firing rate and each neuron, say Ni, has a voltage potential Vi which, in proportion to the gradient of an objective function with when above threshold, makes the neuron send out a spike respect to the neuron’s voltage potential. Si with probability (called rate) Ri. If Rk increases after a In Sec. 3 we then argue that the above interpretation spike Si, in average (over Sk), that will increase Vj and Rj suggests that neural dynamics (which creates the above and thus the probability of Nj’s post-synaptic spike. That changes in neuronal activations thanks to feedback and lat- will come right after the Ni spike, yielding an increase in eral connections) correspond to inference towards neural the synaptic weight Wij as per STDP. However, if Rk de- configurations that are more consistent with each other and creases after a spike Si, this decreases the probability of Nj with the observations (inputs, targets, or rewards). This spiking after Ni’s spike, or equivalently, making the prob- view extends Hinton’s supervised learning proposal to the ability of Nj’s spike occuring before Ni’s spike larger than unsupervised generative setting. It naturally suggests that the probability of Nj’s spike occuring after Ni’s spike, i.e., the training procedure corresponds to a form of variational making Vj and Rj smaller. According to STDP, this situ- EM (Neal & Hinton, 1999) (see Sec.3), possibly based on ation would then correspond to a decrease in the synaptic MAP (maximum a posteriori) or MCMC (Markov Chain weight.Machine In conclusion, Learning these argumentsInterpretation suggest that of STDP Monte-Carlo) approximations. In Sec. 4 we show how this canSpike-Timing be interpreted as follows: Dependent Plasticity mathematical framework suggests a training procedure for a deep generative network with many layers of latent vari- Wij SiVj, (1) ables. However, the above interpretation would still require / • would be SGD on objecve J if S to compute some gradients. Another contribution (Sec. 6) where indicates the temporal@J change, i indicates the is to show that one can estimate these gradients via an ap- pre-synaptic spike (fromVj neuron i), and Vj indicates the proximation that only involves ordinary neural computa- post-synaptic voltage potential⇡ @ (ofVj neuron j). tion and no explicit derivatives, following previous work Clearly,• This corresponds to neural dynamics implemenng a form of the consequence is that if the change Vj cor- on target propagation (Bengio, 2014; Lee et al., 2014). respondsinference to improvingwrt J, seen as a funcon of parameters and latent some objective function J, thenvars Although our primary justification for the proposed learn- STDP corresponds to approximate stochastic gradient ing algorithm corresponds to a deep directed graphical descent in that objective function. With this view, STDP model, it turns out that the proposed learning mechanism would8 implement the delta rule (gradient descent on a one- can be interpreted as training a denoising auto-encoder. As layer network) if the post-synaptic activation changes in the direction of the gradient. Towards Biologically Plausible Deep Learning

3. Variational EM with Learned Approximate the bound becomes tight when q⇤(H x)=p(H x). This | | suggests that q⇤(H x) should approximate p(H x). Fix- Inference | | ing q⇤(H x)=p(H x) and optimizing p with q fixed is To take advantage of the above statement, the dynamics of | | the neural network must be such that neural activities move the EM algorithm. Here (and in general) this is not possi- ble so we consider variational methods in which q⇤(H x) towards better values of some objective function J. Hence | approximates but does not reach p(H x). we would like to define such an objective function in a way | that is consistent with the actual neural computation be- We propose to decompose q⇤(H x) in two components: | ing performed (for fixed weights W ), in the sense that the parametric initialization q (H x)=q(H x) and iterative 0 | | expected temporal change of the voltage potentials approx- inference, implicitly defining q⇤(H x)=q (H x) via a | T | imately corresponds to increases in J. In this paper, we deterministic or stochastic update, or transition operator are going to consider the voltage potentials as the central qt(H x)=A(x) qt 1(H x). (5) variables of interest which influence J and consider them | | as latent variables V (denoted h below to keep machine The variational bound suggests that A(x) should gradu- learning interpretation general), while we will consider the ally bring q (H x) closer to p(H x). At the same time, actual spike trains S as non-linear noisy corruptions of V , t | | to make sure that a few steps will be sufficient to approach a form of quantization (with the “ level” controlled p(H x), one may add a term in the objective function to either by the integration time or the number of redundant | make q (H x) closer to p(H x), as well as to encourage neurons in an ensemble (Legenstein & Maass, 2014). This 0 | | p(x, h) to favor solutions p(H x) that can be easily approx- view makes the application of the denoising auto-encoder | imated by q (H x) even for small t. theorems discussed in Sec. 5 more straightforward. t | For this purpose, consider as training objective a regular- The main contribution of this paper is to propose and give ized variational MAP-EM criterion (for a given x): support to the hypothesis that J comes out of a variational bound on the likelihood of the data. Variational bounds J = log p(x, h)+↵ log q(h x), (6) | have been proposed to justify various learning algorithms for generative models (Hinton et al., 1995) (Sec. 7). To where h is a free variable (for each x) initialized from q(H x) and then iteratively updated to approximately max- keep the mapping to biology open, consider such bounds | and the associated criteria that may be derived from them, imize J. The total objective function is just the average using an abstract notation with observed variable x and la- of J over all examples after having performed inference tent variable h. If we have a model p(x, h) of their joint (the approximate maximization over h for each x). A distribution, as well as some approximate inference mech- reasonable variant would not just encourage q = q0 to h x q t>0 anism defining a conditional distribution q⇤(H x), the ob- generate (given ), but all the t’s for as well. | served data log-likelihood log p(x) can be decomposed as Alternatively, the iterative inference could be performed by stochastically increasing J, i.e., via a Markov chain log p(x) = log p(x) q⇤(h x) which may correspond to probabilistic inference with spik- | ing neurons (Pecevski et al., 2011). The corresponding Xh p(x, h)q⇤(h x) variational MAP or variational MCMC algorithm would be = q⇤(h x) log | as in Algorithm 1. For the stochastic version one would in- | p(h x)q⇤(h x) Xh | | ject noise when updating h. Variational MCMC (de Freitas =Eq (H x)[log p(x, H)] + H[q⇤(H x)] et al., 2001) can be used to approximate the posterior, e.g., ⇤ | | + KL(q⇤(H x) p(H x)), (2) as in the model from Salimans et al. (2014). However, a re- | || | jection step does not look very biologically plausible (both where H[] denotes entropy and KL( ) the Kullback- for the need of returning to a previous state and for the need || Leibler (KL) divergence, and where we have used sums but to evaluate the joint likelihood, a global quantity). On the integrals should be considered when the variables are con- other hand, a biased MCMC with no rejection step, such as tinuous. Since both the entropy and the KL-divergence are the stochastic gradient Langevin MCMC of Welling & Teh non-negative,STDP and we canVariational either bound theEM log-likelihood via (2011) can work very well in practice. log p(x) Eq (H x)[log p(x, H)] + H[q⇤(H x)], (3) • Neural dynamics moving towards “improved” objecve ⇤ | J and | 4. Training a Deep Generative Model parameter updates towards the same J corresponds to a There is strong biological evidence of a distinct pattern of or if wevariaonal care only EM learning algorithm, about optimizing p, connectivity between cortical areas that distinguishes be- log p(x) Eq (H x)[log p(x, H)]. (4) ⇤ | tween “feedforward” and “feedback” connections (Douglas The idea ofApproximate variational inference bounds as proxies for the log- et al., 1989) at the level of the microcircuit of cortex (i.e., likelihood• where isJ = regularized joint likelihood of observed that as far as optimizing p is concerned,x and latent h i.e., feedforward and feedback connections do not land in the dropping theJ entropy= log p term(x, h which) + regularizer does not depend on p, same type of cells). Furthermore, the feedforward connec-

Generave model Inference inial guess All interacons between neurons (forward pass) • Generalizes PSD (Predicve Sparse Decomposion) from (Kavukcuoglu & LeCun 2008) with regularizer = ↵ q(h x) | What Inference Mechanism?

• Simply going down on J’s gradient corresponds to MAP inference (disadvantage: decoder not sufficiently contracve)

• Injecng noise in the process gives a form of approximate posterior MCMC, such as Langevin MCMC 1 @J h˙ = + Brownian noise 2 @h • Or, in discrete me: 1 @J h h + + Normal(0, 1) noise 2 @h

* no rejecon: biased samples, but ok, see (Welling & Teh ICML 2011) 10 Towards Biologically Plausible Deep Learning Towards Biologically Plausible Deep Learning Algorithm 1 Variational MAP (or MCMC) SGD algorithm pairs of an auto-encoder (since the input of one is the tar- Algorithmfor gradually 1 Variational improving MAP the agreement (or MCMC) between SGD algorithm the valuespairsget of of an theauto-encoder other and vice-versa).(since the input Note of that one isfrom the that tar- point for gradually improving the agreement between the values get of the other and vice-versa). Note that from that point of the latent variables h and the observed data x. q(h x) is of view any of the two can act as encoder and the other of the latent variables h and the observed data x. q(h x)|is of view any of the two can act as encoder and the other a learned parametric initialization for h, p(h) is a paramet-| as decoder for it, depending on whether we start from h a learned parametric initialization for h, p(h) is a paramet- as decoder for it, depending on whether we start from h ric prior on the latent variables, and p(x h) specifies how to or from x. In the case of multiple latent layers, each pair ric prior on the latent variables, and p(x h)| specifies how to or from x. In the case of multiple latent layers, each pair generate x given h. Objective function| J is defined in Eq. 6 q(h(k+1) h(k)) p(h(k) h(k+1)) generate x given h. Objective function J is defined in Eq. 6 of conditionalsof conditionalsq(h(k+1) h(k)) and pand(h(k) h(k+1)) formsforms a a ✏ | | | | LearningLearning rates rates andand✏ respectivelyrespectively control control the the optimization optimizationsymmetricsymmetric auto-encoder, auto-encoder, i.e., either i.e., either one can one act can as act the as en- the en- ofofhhandand of of parameters parameters✓✓(of(of both bothq andq andp).p). codercoder and and the theother other as the as corresponding the corresponding decoder, decoder, since since (k) (k+1) InitializeInitializehh qq(h(hxx) ) theythey are trained are trained with with the same the same(h(k)(,hh(k+1),h) pairs) (butpairs with (but with ⇠⇠ | | forfortt=1=1totoTTdodo reversedreversed roles roles of input of input and target). and target). @@JJ hh hh++ @h (optional:(optional: add add noise noise for for MCMC) MCMC) @h In addition,In addition, if noise if noise is injected, is injected, e.g., e.g., in the in form the formof the of the endend for for @J quantizationquantization induced induced by a by spike a spike train, train, then thethen trained the trained ✓✓ ✓✓++✏✏@J @✓@✓ auto-encodersauto-encoders are actually are actually denoising denoising auto-encoders, auto-encoders, which which meansmeans that that both both the encoders the encoders and decoders and decoders are contractive are contractive: : Inferencetionstions Decouples form form a a directed directed acyclic acyclicDeep graph graph Net with with nodes Layers nodes (areas) (areas) up- up- in thein theneighborhood neighborhood of the of observed the observed(x, h)(x,pairs, h) pairs, they map they map dateddated in in a a particular particular order, order, e.g., e.g., in in the the visual visual cortex cortex (Felle- (Felle- neighboringneighboring “corrupted” “corrupted” values values to the to “clean” the “clean”(x, h)(val-x, h) val- manman & & Essen Essen, ,19911991).). So So consider consider Algorithm Algorithm1 with1 withh de-h de- ues.ues. • Aer inference, no need for back-prop because the joint over composed into multiple layers, with the conditional inde- composed into multiple layers, with the conditional inde- 5.1. Joint Denoising Auto-Encoder with Latent pendence structure of a directed graphical model structured 5.1. Joint Denoising Auto-Encoder with Latent layers decouples the updates of the parameters from the pendence structure of a directed graphical model structured Variables as a chain, both for p (going down) and for q (going up): Variables different layers, e.g. as a chain, both for p (going down) and for q (going up): This suggests considering a special kind of “joint” denois- M 1 This suggests considering a special kind of “joint” denois- M 1 ing auto-encoder which has the pair (x, h) as “visible” vari- p(x, h)=p(x h(1)(1)) p(h(k)(kh)(k+1)(k+1)) p(h(M)(M) ) ing auto-encoder which has the pair (x, h) as “visible” vari- p(x, h)=p(x| h ) p(h | h ) p(h ) able, an auto-encoder that implicitly estimates an underly- | k=1 | ! able, an auto-encoder that implicitly3 estimates an underly- Yk=1 ! ing p(x, h). The transition operator for3 that joint visible- M 1Y ing p(x, h). The transition operator for that joint visible- M 1 (1) (k+1) (k) latent denoising auto-encoder is the following in the case q(h x)=q(h (1)x) q(h (k+1)h ()k.) (7) latent denoising auto-encoder is the following in the case q(h| x)=q(h | x) q(h | h ). (7) of a single hidden layer: | | k=1 | of a single hidden layer: Yk=1 ˜ This clearly decouples theY updates associated with each (˜x, h) ˜corrupt(x, h) This clearly decouples the updates associated with each (˜x, h) corrupt(x, h) layer, for both h and ✓, making these updates “local” to h q(h x˜) x p(x h˜), layer, for both h and ✓, making these updates “local” to ˜ (9) • So J could be of the form the layer k, based on “feedback” from layer k 1 and ⇠h |q(h x˜) ⇠ x | p(x h), (9) ⇠ | ⇠ | kthe+1 layer. Nonetheless,k, based on thanks “feedback” to the iterative from layer naturek of the1 and where the corruption may correspond to the stochastic (k) (k+1) (k+1) (k) where the corruption may correspond to the stochastic J = logupdatesk +1p(.h of Nonetheless,h, allh the layers thanks) are + interacting to log theq iterative(h via both nature feedfor-h of) the quantization induced by the neuron non-linearity and spik- updates of(kh) , all(k the1) layers are interacting(k) via(k both1) feedfor- quantization induced by the neuron non-linearity(k) and spik- ward (q(h h| )) and feedback (p(h h |) paths. ing process. In the case of a middle layer h in a deeper k (k)| (k 1) (|k) (k 1) (k) Denotingward (q(hx = h(0)to) simplify) and feedback notation, (p the(hh updateh would) paths. model,ing the process. transition In the operator case of must a middle account layer for theh factin that a deeper X | (0) | 11 thusDenoting consistx in= movesh to of simplify the form notation, the h update would h(k)model,can either the transitionbe reconstructed operator from must above account or from for below, the fact that (k) thus consist in moves@ of the form yielding,h can with either probability be reconstructed say 1 , from above or from below, h(k) h(k) + log(p(h(k 1) h(k))p(h(k) h(k+1))) 2 1 (@k) yielding, with probability say , (k) (k) @h (k |1) (k) (|k) (k+1) (k) (k) ˜(k2+1) h h + (k) log(p(h h )p(h h )) h p(h h ), (10) @h ⇣ (k) (k 1)| (k+1) |(k) ⇠(k) | (k) (k+1) + ↵ log(q(h h )q(h h )) , h p(h h˜ ), (10) ⇣ (k| ) (k 1) (k+1)| (k) and with one minus that probability,⇠ | + ↵ log(q(h h )q(h h(8)⌘)) , | | and with one minus(k) that( probability,k) (k 1) (8)⌘ h q(h h˜ ). (11) where ↵ is as in Eq. 6. No back-propagation is needed ⇠(k) | (k) (k 1) (k) h q(h h˜ ). (11) forwhere the above↵ is as derivatives in Eq. 6 when. Noh back-propagationis on the left hand is needed side Since this interpretation provides⇠ a different| model, it also offor the the conditional above derivatives probability when bar.h(k) Sec.is on6 thedeals left with hand the side providesSince a this different interpretation way of generating providessamples. a different Especially model, it also rightof the hand conditional side case. probability For the left bar. hand Sec. side6 deals case, with e.g., the forprovides shallow, wea different have found way thatof generating better samplessamples. could Especially be p(h(k) h(k+1)) a conditional Gaussian with mean µ and obtained in this way, i.e., running the Markov chain with right hand| side case. For the left hand side case, e.g., for shallow, we have found that better samples could be variance(k) (k2+1), the gradient with respect to h(k) is simply the above transition operator for a few steps. p(h(k) h ) a conditional Gaussian with mean µ and obtained in this way, i.e., running the Markov chain with µ h | 2 (k) variance2 . Note , that the gradient there is an with interesting respect interpretation to h is simply of Therethe might above be transition a geometric operator interpretation for a few for steps. the improved suchµ h( ak) deep model: the layers above h(k) provide a com- 2 . Note that there is an interesting interpretation of quality of the samples when they are obtained in this way, (k) There might be a geometric interpretation for the improved plex implicitly defined prior for p(h ). (k) such a deep model: the layers above h provide a com- 3quality of the samples when they are obtained in this way, (k) See Theorem 1 from Bengio et al. (2013) for the generative 5.plex Alternative implicitly defined Interpretations prior for p(h as) Denoising. interpretation of denoising auto-encoders: it basically states that 3See Theorem 1 from Bengio et al. (2013) for the generative Auto-Encoder one can sample from the model implicitly estimated by a denois- 5. Alternative Interpretations as Denoising inginterpretation auto-encoder by of simply denoising alternating auto-encoders: noise injection it basically (corrup- states that ByAuto-Encoder inspection of Algorithm 1, one can observe that this al- tion),one encoding can sample and decoding, from the these model forming implicitly each estimated step of a gener- by a denois- gorithm trains p(x h) and q(h x) to form complementary ativeing Markov auto-encoder chain. by simply alternating noise injection (corrup- By inspection of Algorithm| 1|, one can observe that this al- tion), encoding and decoding, these forming each step of a gener- gorithm trains p(x h) and q(h x) to form complementary ative Markov chain. | | Towards Biologically Plausible Deep Learning

Algorithm 1 Variational MAP (or MCMC) SGD algorithm pairs of an auto-encoder (since the input of one is the tar- for gradually improving the agreement between the values get of the other and vice-versa). Note that from that point of the latent variables h and the observed data x. q(h x) is of view any of the two can act as encoder and the other | But Inferencea learned Seems parametric to initializationNeed Backprop for h, p(h) is a paramet- as decoder for it, depending on whether we start from h ric prior on the latent variables, and p(x h) specifies how to or from x. In the case of multiple latent layers, each pair | generate x given h. Objective function J is defined in Eq. 6 of conditionals q(h(k+1) h(k)) and p(h(k) h(k+1)) forms a | | Iterave inference, e.g. MAP Learning rates and ✏ respectively control the optimization symmetric auto-encoder, i.e., either one can act as the en- of h and of parameters ✓ (of both q and p). coder and the other as the corresponding decoder, since Initialize h q(h x) they are trained with the same (h(k),h(k+1)) pairs (but with ⇠ | for t =1to T do reversed roles of input and target). h h + @J (optional: add noise for MCMC) @h In addition, if noise is injected, e.g., in the form of the end for @J @J quantization induced by a spike train, then the trained Involves which has terms of the form ✓ ✓ + ✏ @h @✓ auto-encoders are actually denoising auto-encoders, which (k 1) (k) @ log p(h h ) means that both the encoders and decoders are contractive: | tions form a directed@h( acyclick) graph with nodes (areas) up- in the neighborhood of the observed (x, h) pairs, they map dated in a particular order, e.g., in the visual cortex (Felle- neighboring “corrupted” values to the “clean” (x, h) val- to change upper layer to make lower layer value more probable (or man & Essen, 1991). So consider Algorithm 1 with h de- ues. the equivalent for composedq) into multiple layers, with the conditional inde- 5.1. Joint Denoising Auto-Encoder with Latent 12 pendence structure of a directed graphical model structured Variables as a chain, both for p (going down) and for q (going up): This suggests considering a special kind of “joint” denois- M 1 ing auto-encoder which has the pair (x, h) as “visible” vari- p(x, h)=p(x h(1)) p(h(k) h(k+1)) p(h(M)) | | able, an auto-encoder that implicitly estimates an underly- k=1 ! Y ing p(x, h). The transition operator3 for that joint visible- M 1 latent denoising auto-encoder is the following in the case q(h x)=q(h(1) x) q(h(k+1) h(k)). (7) | | | of a single hidden layer: kY=1 (˜x, h˜) corrupt(x, h) This clearly decouples the updates associated with each layer, for both h and ✓, making these updates “local” to h q(h x˜) x p(x h˜), (9) the layer k, based on “feedback” from layer k 1 and ⇠ | ⇠ | k +1. Nonetheless, thanks to the iterative nature of the where the corruption may correspond to the stochastic updates of h, all the layers are interacting via both feedfor- quantization induced by the neuron non-linearity and spik- (k) (k 1) (k) (k 1) (k) ward (q(h h )) and feedback (p(h h ) paths. ing process. In the case of a middle layer h in a deeper | | Denoting x = h(0) to simplify notation, the h update would model, the transition operator must account for the fact that thus consist in moves of the form h(k) can either be reconstructed from above or from below, 1 (k) (k) @ (k 1) (k) (k) (k+1) yielding, with probability say 2 , h h + (k) log(p(h h )p(h h )) @h | | (k) (k) (k+1) h p(h h˜ ), (10) ⇣ (k) (k 1) (k+1) (k) ⇠ | + ↵ log(q(h h )q(h h )) , | | and with one minus that probability, (8)⌘ (k) (k) (k 1) h q(h h˜ ). (11) where ↵ is as in Eq. 6. No back-propagation is needed ⇠ | for the above derivatives when h(k) is on the left hand side Since this interpretation provides a different model, it also of the conditional probability bar. Sec. 6 deals with the provides a different way of generating samples. Especially right hand side case. For the left hand side case, e.g., for shallow, we have found that better samples could be p(h(k) h(k+1)) a conditional Gaussian with mean µ and obtained in this way, i.e., running the Markov chain with | variance 2, the gradient with respect to h(k) is simply the above transition operator for a few steps. µ h(k) 2 . Note that there is an interesting interpretation of There might be a geometric interpretation for the improved (k) such a deep model: the layers above h provide a com- quality of the samples when they are obtained in this way, plex implicitly defined prior for p(h(k)). 3See Theorem 1 from Bengio et al. (2013) for the generative 5. Alternative Interpretations as Denoising interpretation of denoising auto-encoders: it basically states that one can sample from the model implicitly estimated by a denois- Auto-Encoder ing auto-encoder by simply alternating noise injection (corrup- By inspection of Algorithm 1, one can observe that this al- tion), encoding and decoding, these forming each step of a gener- gorithm trains p(x h) and q(h x) to form complementary ative Markov chain. | | But Inference Seems to Need Backprop

How to back-prop through one layer without explicit derivaves?

DIFFERENCE TARGET-PROP

Result: iterave inference 30 climbs J even though no 35 gradients were ever computed 40 and no animal was harmed! 45 average log p(x, h) p(x, log average

50

55

0 5 10 15 20 13 MAP iteration Parenthesis about auto- encoders probabilistic interpretation

14 Regularized Auto-Encoders Learn a Vector Field or a Markov Chain Transition Distribution • (Bengio, Vincent & Courville, TPAMI 2013) review paper • (Alain & Bengio ICLR 2013; Bengio et al, NIPS 2013)

15 Denoising Auto-Encoders Learn a Small Move Towards Higher Probability (Alain & Bengio ICLR 2013)

• Reconstrucon x ˆ points in direcon of higher probability gradient @ log P (x) xˆ x / @x • Trained with input/target pair = (corrupted x˜ à clean data x )

• DAE à Score matching x˜ (Vincent 2011) x xˆ General Result about Denoising (Alain & Bengio ICLR 2013)

• Non-parametric limit: 2 r⇤ = argmin E[ x r(x + z) ] r || || • where z is N(0,1) noise and E[.] is over p(x) and z. Then

r⇤(x) x @ log p(x) = 2 @x

• i.e., following the reconstrucon goes down the gradient

17 Consistency Results (Bengio et al NIPS 2013)

• Denoising AE are consistent esmators of the data-generang distribuon through their Markov chain (corrupt, reconstruct and inject reconstrucon error noise, repeat), so long as they consistently esmate the condional denoising distribuon and the Markov chain converges.

Making P (X X˜) match (X X˜) makes ⇡ (X) match (X) ✓n | P | n P staonary distr. truth denoising distr. truth

• In other words, if the inference mechanism corresponds to corrupon and denoising reconstrucon, we are following the

18 model’s Markov chain. Denoising Score Matching

• An alternave to maximum likelihood for connuous random variables • Asymptocally consistent esmator (as level decreases and # examples increases) 2 @Energy(x) • Reconstrucon: r(x)=x @x • Denoising training objecve, with N(0,1) noise z: E [ r(x + z) x 2] x,z || || à No paron funcon gradient!

19 Extracting Structure By Gradual Disentangling and Manifold Unfolding (Bengio 2014, arXiv 1407.7906)

noise P(hL) Q(hL) Each level transforms the data signal into a representaon in which it f g is easier to model, unfolding it L L more, contracng the noise … dimensions and mapping the Q(h |h ) f g2 P(h2|h1) signal dimensions to a factorized 2 1 2 (uniform-like) distribuon. P(h1) Q(h1) P(x|h ) g1 1 f1 min KL(Q(x, h) P (x, h)) Q(h1|x) || = variaonal auto-encoder criterion Q(x) (Kingma & Welling ICLR 2014) 20 Close parenthesis

21 Towards Biologically Plausible Deep Learning compared to the directed generative model that was defined This view does not care about how q⇤(h x) is constructed, | earlier. Denote q⇤(x) the empirical distribution of the data, but it tells us that if p(x h) is trained to maximize recon- | which defines a joint q⇤(h, x)=q⇤(x)q⇤(h x). Consider struction probability, then we can sample in this way from | the likely situation where p(x, h) is not well matched to the implicitly estimated model. q (h, x) because for example the parametrization of p(h) ⇤ We have also found good results using this procedure (Al- is not powerful enough to capture the complex structure in gorithm 2 below), and from the point of view of biological the empirical distribution q (h) obtained by mapping the ⇤ plausibility, it would make more sense that “generating” training data through the encoder and inference q⇤(h x). | should involve the same operations as “inference”, except Typically, q (x) would concentrate on a manifold and the ⇤ for the input being observed or not. encoder would not be able to completely unfold it, so that q⇤(h) would contain complicated structure with pockets or 6. Targetprop instead of Backprop manifolds of high probability. If p(h) is a simple factorized In Algorithm 1 and the related stochastic variants Eq. 8 model, then it will generate values of h that do not corre- suggests that back-propagation (through one layer) is still spond well to those seen by the decoder p(x h) when it needed when h(k) is on the right hand side of the con- | (k 1 (k) was trained, and these out-of-manifold samples in h-space @p(h ) h ) ditional probability bar, e.g., to compute | . are likely to be mapped to out-of-manifold samples in x- @h(k) Such a gradient is also the basic building block in back- space. One solution to this problem is to increase the ca- propagation for supervised learning: we need to back-prop pacity of p(h) (e.g., by adding more layers on top of h). through one layer, e.g. to make h(k) more “compatible” Another is to make q(h x) more powerful (which again can | with h(k 1). This provides a kind error signal, which in be achieved by increasing the depth of the model, but this the case of unsupervised learning comes from the sensors, time by inserting additional layers below h). Now, there and in the case of supervised learning, comes from the layer is a cheap way of obtaining a very deep directed graphical holding the observed “target”. model, by unfolding the Markov chain of an MCMC-based generative model for a fixed number of steps, i.e., consid- Based on recent theoretical results on denoising auto- ering each step of the Markov chain as an extra “layer” encoders, we propose the following estimator (up to a scal- in a deep directed generative model, with shared parame- ing constant) of the required gradient, which is related to ters across these layers. As we have seen that there is such previous work on “target propagation” (Bengio, 2014; Lee an interpretation via the joint denoising auto-encoder over et al., 2014) or targetprop for short. To make notation sim- both latent and visible, this idea can be immediately ap- pler, we focus below on the case of two layers h and x with plied. We know that each step of the Markov chain opera- “encoder” q(h x) and “decoder” p(x h), and we want to tor moves its input distribution closer to the stationary dis- @ log p|(x h) | estimate @h | . We start with the special case where tribution of the chain. So if we start from samples from a p(x h) is a Gaussian with mean g(h) and q(h x) is Gaus- very broad (say factorized) prior p(h) and we iteratively en- | | Differencesian with mean Target-Propf(x), i.e., f and g areEstimator the deterministic code/decode them (injecting noise appropriately as during components of the encoder and decoder respectively. The training) by successively sampling from p(x h) and then | proposed estimator is then from q(h x), the resulting h samples should end up look-• If the encoder is f(x)+noise and the decoder is g(h)+noise, then f(x) f(g(h)) | q (h) ing more like those seen during training (i.e., from ⇤ ). h = 2 , (12) @ log p(x h) f(x)h f(g(h)) 5.2. Latent Variables as Corruption 2 | where h is the variancec of the noise injected2 in q(h x). There is another interpretation of the training procedure, @h ⇡ h | also as a denoising auto-encoder, which has the advantage Let us now justify this estimator. Theorem 2 by Alain of producing a generative procedure that is the same as• thewhich is demonstrated by exploing & Bengio (2013) states that in a denoising auto-encoder with reconstruction function r(x)=decode(encode(x)), inference procedure except for x being unclamped. log p(x h) = log p(x, h) log p(h) a well-trained auto-encoder| estimates the log-score via the We return again to the generative interpretation of the• de-and the DAE score esmator theorem difference between its reconstruction and its input: noising criterion for auto-encoders, but this time we con- r(x) x @ log p(x) sider the non-parametric process q⇤(h x) as a kind of cor- , | 2 ruption of x that yields the h used as input for reconstruct- ! @x ing the observed x via p(x h). Under that interpretation,• Considering two DAEs, one with where 2 is the variance ofh injected as “visible” and one with noise, and p(x) is the(x,h) | a valid generative procedure consists at each step in22 first implicitly estimated density. We are now going to con- performing inference, i.e., sampling h from q⇤(h x), and sider two denoising auto-encoders and apply this theorem | @ log p(x h) second sampling from p(x h). Iterating these steps gener- to them. First, we note that the gradient | that we | @h ates x’s according to the Markov chain whose stationary wish to estimate can be decomposed as follows: distribution is an estimator of the data generating distribu- @ log p(x h) @ log p(x, h) @ log p(h) tion that produced the training x’s (Bengio et al., 2013). | = . @h @h @h Towards Biologically Plausible Deep Learning

@ log p(x,h) 550 Hence it is enough to estimate @h as well as Algorithm 2 Inference, training and generative procedures 605 551 @ log p(h) used in Experiment 1, for a model with three layers x, h1, 606 @h . The second one can be estimated by consider- 552 h . f () is the feedforward map from layer i 1 to layer 607 ing the auto-encoder which estimates p(h) implicitly and 2 i 553 i and g () is the feedback map from layer i to layer i 1, 608 for which g is the encoder (with g(h) the “code” for h) and i 554 f is the decoder (with f(g(h)) the “reconstruction” of h). with x = h0 being layer 0. 609 Decomposition555 of thef( g(h)) h @ log p(h) x N ↵ 610 Define INFERENCE( , =15, =0.1, =0.001): Hence we have that 2 is an estimator of @h . gradient556 into reconstructionsh Feedforward pass: h f (x), h f (h ) 611 h 1 1 2 2 1 • 557We wantThe other gradient can be estimated by considering the joint for t =1to N do 612 f 558@ log denoisingp(x h) @ auto-encoderlog p(x, h) over@ log(x,p( hh))introduced ing the previ- h2 h2 + (f2(h1) f2(g2(h2))) 613 | = 559 @oush section. The@h (noise-free) @h reconstruction function for h1 h1 + (f1(x) f1(g1(h1))) + ↵(g2(h2) h1) 614 x 560 that auto-encoder is end for 615 • which we get from two auto-encoders: 561 Return h1, h2 616 r(x, h)=(g(h),f(x)). 5621. The (x,h) to (h,x) AE: 617 à f(x) fh(x) @hlog p(x, h) @ log p(x,h) 563 2 Define TRAIN() 618 Hence ⇡2 is@ anh estimator of @h . Combining the 564 h for x in training set do 619 2. The AE two estimators,with h as « visible » and we get x as « representaon » 565 f(g(h)) h @ log p(h) do INFERENCE(x) 620 à (f(x) h) (f(g(h)) h) f(x) f(g(h)) 566 2 ⇡ @h = , train each layer (both fl and gl) by taking Gaussian- 621 • Result: 2 2 2 567 @ log p(xhh) f(x) fh (g(h)) h corrupted value of other layer as input and the clean 622

568 | 2 inferred value as target (i.e. applying the delta rule). 623 23 which@ correspondsh ⇡ to Eq.12. 569 h For the top sigmoid layer, we sample 3 binary values 624 570 and average them as a spike-like corruption. 625 571 end for 626 572 Compute the mean and variance of the h2 values inferred 627 573 in the training set. Multiply the variances by 4. Define 628 574 p(h2) as sampling from this Gaussian. 629 575 630 576 Define GENERATE(): 631 577 Sample h2 from p(H2) 632 578 Figure 1. The optimal h for maximizing p(x h) is h˜ s.t. g(h˜)= for t =1to 3 do 633 579 | h ,h INFERENCE(x,↵ =0.3) 634 x. Since the encoder f and decoder g are approximate inverses 1 2 580 x x g1(h1) 635 of each other, their composition makes a small move . Eq. 12 581 is obtained by assuming that by considering an x˜ at x and end for 636 582 applying f g, one would approximately recover x, which should Return x 637 583 be true if the changes are small and the functions smooth (see Lee 638 584 & Bengio (2014) for a detailed derivation). 639 585 Another way to obtain the same formula from a geomet- likelihood. There is also an interesting connection with an 640 586 ric perspective is illustrated in Figure 1. It was introduced earlier proposal for a more biologically plausible imple- 641 587 in Lee & Bengio (2014) in the context of a backprop-free mentation of supervised back-propagation (Xie & Seung, 642 588 algorithm for training a denoising auto-encoder. 2003) which also relies on iterative inference (a determin- 643 589 istic relaxation in that case), but needs symmetric weights. 644 590 7. Related Work 645 591 The main inspiration for the proposed framework is the Another important inspiration is Predictive Sparse Decom- 646 592 biological implementation of back-propagation proposed position (PSD) (Kavukcuoglu et al., 2008). PSD is a spe- 647 593 by Hinton (2007). In that talk, Hinton suggests that STDP cial case of Algorithm 1 when there is only one layer and 648 594 corresponds to a gradient update step with the gradient on the encoder q(h x), decoder p(x h), and prior p(h) have a 649 | | 595 the voltage potential corresponding to its temporal deriva- specific form which makes p(x, h) a sparse coding model 650 596 tive. To obtain the supervised back-propagation update in and q(h x) a fast parametric approximation of the correct 651 | 597 the proposed scenario would require symmetric weights posterior. Our proposal extends PSD by providing a justi- 652 598 and synchronization of the computations in terms of feed- fication for the training criterion as a variational bound, by 653 599 forward and feedback phases. generalizing to multiple layers of latent variables, and by 654 providing associated generative procedures. 600 Our proposal extends these ideas to include unsupervised 655 601 learning, avoids the need for symmetric weights, and ex- The combination of a parametric approximate inference 656 602 ploits inference to obtain targets and a probabilistic inter- machine (the encoder) and a generative decoder (each with 657 603 pretation as the optimization of a variational bound on the possibly several layers of latent variables) is an old theme 658 604 659 Same Formula justifies Backprop-free Auto-Encoder based on Target-Prop • If r(x)=f(g(h)) is smooth and makes a small move away from x, then applying r from x˜ = x x = x (g(f(x)) x)=2x g(f(x)) • should approximately give x, so g(h˜) x ⇡ • where h˜ = f(˜x)=f(2x g(f(x))) • And the encoder should be trained on the pair (˜x, h˜)

24 Difference Target-Prop for Inexact DifferenceInverse Target Propagation •ˆ Make a correcon that guarantees to ˆ hi 1 = hi 1 gi(hi)+gi(hi) hi hˆi first order that the projecon esmated target is closer to the fi(hˆi 1) Differenceˆ ˆTarget Propagation fi(hi 1)=fi(hi 1 gi(hi)+gi(hi)) correct target than the original value ˆ ˆ fi(hi 1 + gi0(hi)(hi hi)) if hi hi ˆ ˆ gi hi 1⇡= hi1 gi(hi)+ gi(hi) ⇡ hi hˆi ˆ fi fi(hi1)+ fi0(hi 1)gi0(hi)(hi hi) ⇡ ˆ • Special case: feedback alignement, if fi(hi 1) ˆ ˆ ˆ fi(hiˆ 1)=fi(hi 1 gi(hi)+giˆ(hi)) higi(h) = B h fi(hi 1) [I fi0(hi 1)gi0(hi)] (hi hi) ⇡ ˆ f (h + g0(h )(hˆ h )) if hi hˆi 2 i i 1 2 i i i i hi 1 hi 1 ⇡ gi hˆi fi(hˆi 1)⇡ < hˆi hi ˆ fi fi(hi 1)+fi0(hi 1)gi0T(hi)(hi hi) if 1 > max eigen⇡ value (I fi0(hi 1)gi0(hi)) (I fi0(hi 1)gi0(hi)) h i 25 ˆ gˆ don’t need to be inverse mapping !ˆ ! hi fi(hi 1) [I fi0(hi 1)gi0(hi)] (hi hi) if this⇡ condition is satisfied ˆ 2 2 hi 1 hi 1 ˆ ˆ But hˆwei canfi( hˆgeti 1exact) < targethˆi if h f ii ( g i (hi)) = hi T if 1 > max eigen value (I fi0(hi 1)gi0(hi)) (I fi0(hi 1)gi0(hi)) h i g don’t need to be inverse mapping ! ! if this condition is satisfied

But we can get exact target if f i ( g i (hˆi)) = hˆi Obligatory MNIST Results (supervised target-prop)Experimental Result

Blah Hyper-opmizing for Hyper-opmizing for validaon error validaon error Experimental Result

• We used hyper-parameters for the best valid Hyper-erroropmizing respectively for training error • Test error : 1.73% : target prop with high regression ↵ =0.99 1.62%Hyper-opmizing : difference for target prop, 1.44%validaon : errorback-prop, respective learning rates 26

• Left graph : Hyper-parameters for the best valid error

• Right graph : Hyper-parameters for the best training cost at 100 epoch

• Target prop is sometimes faster than back-prop though it is usually overfitting, but it can solve under-fitting problem (ex - very deep net, highly non-linear net and discrete net) Targetprop can work for discrete and/or stochastic activations

ExperimentalWork in progress Result

27

• We used hyper-parameters for the best valid error

• Test error : ~2.5% (discrete networks with 3 hidden layers), ~2.5% (discrete networks with 2 hidden layers), 5~6% (just training top classifier with 2 hidden : back-prop) Iterated Target-Prop Generative Deep Learning Experiments on MNIST

Original examples Inpainng starng point Generated model samples

Inpainng missing values (starng from noise) Inpainted 28 What’s Next? • Experiments only involved p terms in J, but if there is going to be mulple modalies, we need correcon signals (target prop) from above as well as from below

• Using true gradients instead of diff targetprop yielded beer final values of J aer each inference iteraon but a worse final value of J aer training. Why?

• Proposed theory suggests that using only a few inference iteraons should give a sufficient signal to update weights, but experiments required 10-15.

• Updates in paper did not follow the STDP framework but used final inference values as targets 29 Why Noise is Needed

• Up to now we used a MAP inference in our experiments

• Adding noise appropriately makes it a biased Langevin MCMC, making the inference procedure approximately sample from the posterior of latent given visible

• Noise may be necessary to appropriately prepare the decoder to face the inadequacy of the higher-levels ‘prior’, by becoming contracve

• It comes up automacally in the variaonal auto-encoder criterion

30 The Importance of Contractive Decoder

• Denoising à contracve g If f bijecve P(x)=P(h=f(x))|det f’(x)| • Max. determinant of f’ à f expansive at P(h) data x, g contracve around Q(h) • Contracon à removes unnecessary direcons

• Making g contracve helps to manage f g the mismatch between P(h) and Q(h) • Adding noise at the top-level in Q(h|x) shows to the decoder which direcons of h need to be contracted out, making it contracve Q(x) Many Probabilistic Interpretations e.g. EM Denoising Score Matching

• A reconstrucon funcon (state à state) embodies energy gradient (to improved state) and defines neural dynamics • Use it for inference, e.g. Langevin MCMC, i.e., update state towards reconstrucon, with some noise injected • Given visible x, do inference to sample h ~ posterior given x • Consider state s=(x,h) as if they were visible and perform a denoising score matching update of parameter i.e., min reconstruct(corrupt(state)) state 2 ✓ || || • Any energy funcon can be defined, but some give rise to biologically plausible neural dynamics

32 Ongoing: Impatient Learned Approximate Inference

• Instead of waing for the last step of inference (to be used as target a la EM), we can ask each inference step to land where the next step will land, i.e., to speed-up the MCMC burn-in • i.e., target state = later in the chain corrupted state = noisy, earlier state in the chain reconstrucon error becomes PREDICTION error • This would result in an SDTP-like update, at every me step, not just at the end of inference

A S2 is a target for the S A A 0 S1 S2 S3 output of A applied to S0

A wants to become A2

33 MILA: Montreal Institute for Learning Algorithms