The Helmholtz Machine Revisited, EPFL2012

Total Page:16

File Type:pdf, Size:1020Kb

The Helmholtz Machine Revisited, EPFL2012 Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The Helmholtz machine revisited Danilo Jimenez Rezende BMI-EPFL November 8, 2012 Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks 1 Introduction 2 Variational Approximation 3 Relevant special cases 4 Learning with non-factorized qs 5 Extending the model over time 6 Final remarks Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Only X0 is observed. Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Helmholtz machines Helmholtz machines [Dayan et al., 1995, Dayan and Hinton, 1996, Dayan, 2000] are directed graphical models with a layered structure: XL Complete data likelihood: L−1 g Y g p(X jθ ) = p(Xl jXl+1; θ )p(XL); X1 l=0 X0 Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Helmholtz machines Helmholtz machines [Dayan et al., 1995, Dayan and Hinton, 1996, Dayan, 2000] are directed graphical models with a layered structure: XL Complete data likelihood: L−1 g Y g p(X jθ ) = p(Xl jXl+1; θ )p(XL); X1 l=0 Only X0 is observed. X0 Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Smooth units: g g g p(Xl jXl+1) = N(Xl ; tanh(Wl Xl+1 + Bl ); Σl ): Parameters θg = fW g ; Bg ; Σg g. Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Types of Helmholtz machines Binary units: g g p(Xl jXl+1) = Bern ◦ sigmoid(Wl Xl+1 + Bl ); Parameters θg = fW g ; Bg g. Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Types of Helmholtz machines Binary units: g g p(Xl jXl+1) = Bern ◦ sigmoid(Wl Xl+1 + Bl ); Parameters θg = fW g ; Bg g. Smooth units: g g g p(Xl jXl+1) = N(Xl ; tanh(Wl Xl+1 + Bl ); Σl ): Parameters θg = fW g ; Bg ; Σg g. Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Goal For a data-set of iid samples y 2 Data, maximize data log-likelihood w.r.t θg X ln p(Datajθg ) = ln p(yjθg ); y2Data where Z L L−1 g Y Y g p(yjθ ) = dXl p(yjX1) p(Xl jXl+1; θ )p(XL) l>0 l=1 Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Then g r F(X0,θ ,θ ) g z r }| { r ln p(X0jθ ) = − hln q(Xl>0jX0; θ ) − ln p(X )iq(Xl>0jX0,θ ) +KL(q; p); where r q(Xl>0jX0; θ ) KL(q; p) = hln i r g q(Xl>0jX0,θ ) p(Xl>0jX0; θ ) g g r r KL ≥ 0 ) ln p(X0; θ ) ≥ −F(X0; θ ; θ ) 8θ Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick r Introduce a parametric family of distributions q(Xl>0jX0; θ ) Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited r q(Xl>0jX0; θ ) KL(q; p) = hln i r g q(Xl>0jX0,θ ) p(Xl>0jX0; θ ) g g r r KL ≥ 0 ) ln p(X0; θ ) ≥ −F(X0; θ ; θ ) 8θ Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick r Introduce a parametric family of distributions q(Xl>0jX0; θ ) Then g r F(X0,θ ,θ ) g z r }| { r ln p(X0jθ ) = − hln q(Xl>0jX0; θ ) − ln p(X )iq(Xl>0jX0,θ ) +KL(q; p); where Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited g g r r KL ≥ 0 ) ln p(X0; θ ) ≥ −F(X0; θ ; θ ) 8θ Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick r Introduce a parametric family of distributions q(Xl>0jX0; θ ) Then g r F(X0,θ ,θ ) g z r }| { r ln p(X0jθ ) = − hln q(Xl>0jX0; θ ) − ln p(X )iq(Xl>0jX0,θ ) +KL(q; p); where r q(Xl>0jX0; θ ) KL(q; p) = hln i r g q(Xl>0jX0,θ ) p(Xl>0jX0; θ ) Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick r Introduce a parametric family of distributions q(Xl>0jX0; θ ) Then g r F(X0,θ ,θ ) g z r }| { r ln p(X0jθ ) = − hln q(Xl>0jX0; θ ) − ln p(X )iq(Xl>0jX0,θ ) +KL(q; p); where r q(Xl>0jX0; θ ) KL(q; p) = hln i r g q(Xl>0jX0,θ ) p(Xl>0jX0; θ ) g g r r KL ≥ 0 ) ln p(X0; θ ) ≥ −F(X0; θ ; θ ) 8θ Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited q can be any distribution with same support as p Choose q so that we can calculate expectations in a reasonable time Allows to solve inference using standard optimization techniques Why ? Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick Redefine the learning problem as X fθ^g ; θ^r g = arg min F(x; θg ; θr ); g r θ ,θ x2Data Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Choose q so that we can calculate expectations in a reasonable time Allows to solve inference using standard optimization techniques Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick Redefine the learning problem as X fθ^g ; θ^r g = arg min F(x; θg ; θr ); g r θ ,θ x2Data Why ? q can be any distribution with same support as p Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Allows to solve inference using standard optimization techniques Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick Redefine the learning problem as X fθ^g ; θ^r g = arg min F(x; θg ; θr ); g r θ ,θ x2Data Why ? q can be any distribution with same support as p Choose q so that we can calculate expectations in a reasonable time Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick Redefine the learning problem as X fθ^g ; θ^r g = arg min F(x; θg ; θr ); g r θ ,θ x2Data Why ? q can be any distribution with same support as p Choose q so that we can calculate expectations in a reasonable time Allows to solve inference using standard optimization techniques Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Expectations are analytically tractable for sigmoid/tanh nonlinearities [Frey, 1996, Jordan, 1999] Yields local message-passing type of algorithms Fast convergence Bad approximation to multimodal posteriors Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Fully factorized q r Y r q(Xl>0jX0; θ ) = p(Xl jθl;i ); l>0;i Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Yields local message-passing type of algorithms Fast convergence Bad approximation to multimodal posteriors Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Fully factorized q r Y r q(Xl>0jX0; θ ) = p(Xl jθl;i ); l>0;i Expectations are analytically tractable for sigmoid/tanh nonlinearities [Frey, 1996, Jordan, 1999] Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Fast convergence Bad approximation to multimodal posteriors Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Fully factorized q r Y r q(Xl>0jX0; θ ) = p(Xl jθl;i ); l>0;i Expectations are analytically tractable for sigmoid/tanh nonlinearities [Frey, 1996, Jordan, 1999] Yields local message-passing type of algorithms Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Bad approximation to multimodal posteriors Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Fully factorized q r Y r q(Xl>0jX0; θ ) = p(Xl jθl;i ); l>0;i Expectations are analytically tractable for sigmoid/tanh nonlinearities [Frey, 1996, Jordan, 1999] Yields local message-passing type of algorithms Fast convergence Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Fully factorized q r Y r q(Xl>0jX0; θ ) = p(Xl jθl;i ); l>0;i Expectations are analytically tractable for sigmoid/tanh nonlinearities [Frey, 1996, Jordan, 1999] Yields local message-passing type of algorithms Fast convergence Bad approximation to multimodal posteriors Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Conditioned on X0 Cannot solve any expectation analytically Resort to Monte Carlo approximations Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Helmholtz' machine choice of q Bottom-up graph [Dayan et al., 1995, X L Dayan and Hinton, 1996, Dayan, 2000]: L r Y r q(Xl>0jX0; θ ) = p(Xl jXl−1; θ
Recommended publications
  • Natural Reweighted Wake-Sleep
    Natural Reweighted Wake-Sleep Csongor Várady Riccardo Volpi Max Planck Institute for Romanian Institute of Mathematics in the Sciences Science and Technology Leipzig, Germany Cluj-Napoca, Romania [email protected] [email protected] Luigi Malagò Nihat Ay Romanian Institute of Max Planck Institute for Science and Technology Mathematics in the Sciences Cluj-Napoca, Romania Leipzig, Germany [email protected] [email protected] Abstract Natural gradient has been successfully employed in a wide range of optimization problems. However, for the training of neural networks the resulting increase in computational complexity sets a limitation to its practical application. Helmholtz Machines are a particular type of generative models, composed of two Sigmoid Belief Networks, commonly trained using the Wake-Sleep algorithm. The locality of the connections in this type of networks induces sparsity and a particular structure for the Fisher information matrix that can be exploited for the evaluation of its inverse, allowing the efficient computation of the natural gradient also for large networks. We introduce a novel algorithm called Natural Reweighted Wake-Sleep, a geometric adaptation of Reweighted Wake-Sleep, based on the computation of the natural gradient. We present an experimental analysis of the algorithm in terms of speed of convergence and the value of the log-likelihood, both with respect to number of iterations and training time, demonstrating improvements over non-geometric baselines. 1 Introduction Deep generative models have been successfully employed in unsupervised learning to model complex and high dimensional distributions thanks to their ability to extract higher-order representations of the data and thus generalize better [16, 8].
    [Show full text]
  • Introduction Boltzmann Machine: Learning
    Introduction to Neural Networks Probability and neural networks Initialize standard library files: Off[General::spell1]; SetOptions[ContourPlot, ImageSize → Small]; SetOptions[Plot, ImageSize → Small]; SetOptions[ListPlot, ImageSize → Small]; Introduction Last time From energy to probability. From Hopfield net to Boltzmann machine. We showed how the Hopfield network, which minimized an energy function, could be viewed as finding states that increase probabil- ity. Neural network computations from the point of view of statistical inference By treating neural network learning and dynamics in terms of probability computations, we can begin to see how a common set of tools and concepts can be applied to: 1. Inference: as a process which makes the best guesses given data. E.g. find H such that p(H | data) is biggest. 2. Learning: a process that discovers the parameters of a probability distribution. E.g. find weights such that p(weights | data) is biggest 3. Generative modeling: a process that generates data from a probability distribution. E.g. draw samples from p(data | H) Today Boltzmann machine: learning the weights Probability and statistics overview Introduction to drawing samples Boltzmann Machine: Learning We've seen how a stochastic update rule improves the chances of a network evolving to a global minimum. Now let's see how learning weights can be formulated as a statistical problem. 2 Lect_13_Probability.nb The Gibbs distribution again Suppose T is fixed at some value, say T=1. Then we could update the network and let it settle to ther- mal equilibrium, a state characterized by some statistical stability, but with occasional jiggles. Let Vα represent the vector of neural activities.
    [Show full text]
  • Factor Analysis Using Delta-Rule Wake-Sleep Learning
    Communicated by Robert Jacobs Factor Analysis Using Delta-Rule Wake-Sleep Learning Radford M. Neal Department of Statistics and Department of Computer Science, University of Toronto, Downloaded from http://direct.mit.edu/neco/article-pdf/9/8/1781/813800/neco.1997.9.8.1781.pdf by guest on 24 September 2021 Toronto M5S 1A1, Canada Peter Dayan Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, 02139 U.S.A. We describe a linear network that models correlations between real-valued visible variables using one or more real-valued hidden variables—a fac- tor analysis model. This model can be seen as a linear version of the Helmholtz machine, and its parameters can be learned using the wake- sleep method, in which learning of the primary generative model is as- sisted by a recognition model, whose role is to fill in the values of hidden variables based on the values of visible variables. The generative and recognition models are jointly learned in wake and sleep phases, using just the delta rule. This learning procedure is comparable in simplicity to Hebbian learning, which produces a somewhat different representa- tion of correlations in terms of principal components. We argue that the simplicity of wake-sleep learning makes factor analysis a plausible al- ternative to Hebbian learning as a model of activity-dependent cortical plasticity. 1 Introduction Statistical structure in a collection of inputs can be found using purely local Hebbian learning rules (Hebb, 1949), which capture second-order aspects of the data in terms of principal components (Linsker, 1988; Oja, 1989).
    [Show full text]
  • Arxiv:1506.00354V2 [Q-Bio.NC] 24 Jul 2015
    Learning with hidden variables Yasser Roudi Kavli Institute & Centre for Neural Computation, NTNU, Trondheim, Norway Nordita, Stockholm, Sweden Graham Taylor School of Engineering, University of Guelph, Guelph, Canada Abstract Learning and inferring features that generate sensory input is a task continuously performed by cortex. In recent years, novel algorithms and learning rules have been proposed that allow neural network models to learn such features from natural images, written text, audio signals, etc. These networks usually involve deep architectures with many layers of hidden neurons. Here we review recent advancements in this area emphasizing, amongst other things, the processing of dynamical inputs by networks with hidden nodes and the role of single neuron models. These points and the questions they arise can provide conceptual advancements in understanding of learning in the cortex and the relationship between machine learning approaches to learning with hidden nodes and those in cortical circuits. Keywords: statistical models, deep learning, dynamics 1. Introduction Learning the fundamental features that generate sensory signals and having the ability to infer these features once given the sensory input is a crucial aspect of information processing. It would not be far fetched to hypothesize that the organization of cortical circuitry is largely evolved for performing arXiv:1506.00354v2 [q-bio.NC] 24 Jul 2015 such tasks, or to put it slightly differently, that our evolutionary history has stored memories of such features in the connections that form a large part of our brains. But how can a neuronal network learn and extract features that cause the experiences of our sensory organs, in a supervised or unsupervised manner? Answering this question has been, and still is, a fundamental task of theoretical and experimental neuroscientists and those interested in artificial intelligence.
    [Show full text]
  • Does the Wake-Sleep Algorithm Produce Good Density Estimators?
    Does the Wake-sleep Algorithm Produce Good Density Estimators? Brendan J. Frey, Geoffrey E. Hinton Peter Dayan Department of Computer Science Department of Brain and Cognitive Sciences University of Toronto Massachusetts Institute of Technology Toronto, ON M5S 1A4, Canada Cambridge, MA 02139, USA {frey, hinton} @cs.toronto.edu [email protected] Abstract The wake-sleep algorithm (Hinton, Dayan, Frey and Neal 1995) is a rel­ atively efficient method of fitting a multilayer stochastic generative model to high-dimensional data. In addition to the top-down connec­ tions in the generative model, it makes use of bottom-up connections for approximating the probability distribution over the hidden units given the data, and it trains these bottom-up connections using a simple delta rule. We use a variety of synthetic and real data sets to compare the per­ formance of the wake-sleep algorithm with Monte Carlo and mean field methods for fitting the same generative model and also compare it with other models that are less powerful but easier to fit. 1 INTRODUCTION Neural networks are often used as bottom-up recognition devices that transform input vec­ tors into representations of those vectors in one or more hidden layers. But multilayer net­ works of stochastic neurons can also be used as top-down generative models that produce patterns with complicated correlational structure in the bottom visible layer. In this paper we consider generative models composed of layers of stochastic binary logistic units. Given a generative model parameterized by top-down weights, there is an obvious way to perform unsupervised learning.
    [Show full text]
  • Flexible and Accurate Inference and Learning for Deep Generative Models
    Flexible and accurate inference and learning for deep generative models Eszter Vértes Maneesh Sahani Gatsby Computational Neuroscience Unit University College London London, W1T 4JG {eszter, maneesh}@gatsby.ucl.ac.uk Abstract We introduce a new approach to learning in hierarchical latent-variable generative models called the “distributed distributional code Helmholtz machine”, which emphasises flexibility and accuracy in the inferential process. Like the original Helmholtz machine and later variational autoencoder algorithms (but unlike adver- sarial methods) our approach learns an explicit inference or “recognition” model to approximate the posterior distribution over the latent variables. Unlike these earlier methods, it employs a posterior representation that is not limited to a narrow tractable parametrised form (nor is it represented by samples). To train the genera- tive and recognition models we develop an extended wake-sleep algorithm inspired by the original Helmholtz machine. This makes it possible to learn hierarchical latent models with both discrete and continuous variables, where an accurate poste- rior representation is essential. We demonstrate that the new algorithm outperforms current state-of-the-art methods on synthetic, natural image patch and the MNIST data sets. 1 Introduction There is substantial interest in applying variational methods to learn complex latent-variable generative models, for which the full likelihood function (after marginalising over the latent variables) and its gradients are intractable. Unsupervised learning of such models has two complementary goals: to learn a good approximation to the distribution of the observations; and also to learn the underlying structural dependence so that the values of latent variables may be inferred from new observations.
    [Show full text]
  • Backpropagation and the Brain the Structure That Is Only Implicit in the Raw Sensory Input
    PERSPECTIVES predominantly unsupervised fashion1,25–27, building representations that make explicit Backpropagation and the brain the structure that is only implicit in the raw sensory input. It is natural to wonder, then, Timothy P. Lillicrap , Adam Santoro, Luke Marris, Colin J. Akerman and whether backprop has anything to tell us Geoffrey Hinton about learning in the brain25,28–30. Here we argue that in spite of these Abstract | During learning, the brain modifies synapses to improve behaviour. In the apparent differences, the brain has the capacity cortex, synapses are embedded within multilayered networks, making it difficult to implement the core principles underlying to determine the effect of an individual synaptic modification on the behaviour of backprop. The main idea is that the brain the system. The backpropagation algorithm solves this problem in deep artificial could compute effective synaptic updates by using feedback connections to induce neuron neural networks, but historically it has been viewed as biologically problematic. activities whose locally computed differences Nonetheless, recent developments in neuroscience and the successes of artificial encode backpropagation-like error signals. neural networks have reinvigorated interest in whether backpropagation offers We link together a seemingly disparate set insights for understanding learning in the cortex. The backpropagation algorithm of learning algorithms into this framework, learns quickly by computing synaptic updates using feedback connections to which we call ‘neural gradient representation 9,27,31–41 deliver error signals. Although feedback connections are ubiquitous in the cortex, by activity differences’ (NGRAD) . The NGRAD framework demonstrates that it is difficult to see how they could deliver the error signals required by strict it is possible to embrace the core principles formulations of backpropagation.
    [Show full text]
  • Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey
    To appear as a part of an upcoming textbook on dimensionality reduction and manifold learning. Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey Benyamin Ghojogh [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada Ali Ghodsi [email protected] Department of Statistics and Actuarial Science & David R. Cheriton School of Computer Science, Data Analytics Laboratory, University of Waterloo, Waterloo, ON, Canada Fakhri Karray [email protected] Department of Electrical and Computer Engineering, Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, ON, Canada Mark Crowley [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada Abstract to be useful for modeling the physical systems statisti- This is a tutorial and survey paper on Boltz- cally (Huang, 1987). One of these systems was the Ising mann Machine (BM), Restricted Boltzmann Ma- model which modeled interacting particles with binary chine (RBM), and Deep Belief Network (DBN). spins (Lenz, 1920; Ising, 1925). Later, the Ising model We start with the required background on prob- was found to be able to be a neural network (Little, 1974). abilistic graphical models, Markov random field, Hence, Hopfield network was proposed which modeled an Gibbs sampling, statistical physics, Ising model, Ising model in a network for modeling memory (Hopfield, and the Hopfield network. Then, we intro- 1982). Inspired by the Hopfield network (Little, 1974; duce the structures of BM and RBM. The con- Hopfield, 1982), which was itself inspired by the physi- ditional distributions of visible and hidden vari- cal Ising model (Lenz, 1920; Ising, 1925), Hinton et.
    [Show full text]
  • Bidirectional Helmholtz Machines
    Workshop track - ICLR 2016 BIDIRECTIONAL HELMHOLTZ MACHINES Jorg¨ Bornschein, Samira Shabanian, Asja Fischer, Yoshua Bengio∗ Department of Computer Science and Operations Research University of Montreal Montreal, Quebec, Canada fbornj,shabanis,fischer,<findme>[email protected] ABSTRACT Efficient unsupervised training and inference in deep generative models remains a challenging problem. One basic approach, called Helmholtz machine, involves training a top-down directed generative model together with a bottom-up auxil- iary model that is trained to help perform approximate inference. Recent results indicate that better results can be obtained with better approximate inference pro- cedures. Instead of employing more powerful procedures, we here propose to regularize the generative model to stay close to the class of distributions that can be efficiently inverted by the approximate inference model. We achieve this by interpreting both the top-down and the bottom-up directed models as approximate inference distributions and by defining the model distribution to be the geometric mean of these two. We present a lower-bound for the likelihood of this model and we show that optimizing this bound regularizes the model so that the Bhat- tacharyya distance between the bottom-up and top-down approximate distribu- tions is minimized. We demonstrate that we can use this approach to fit generative models with many layers of hidden binary stochastic variables to complex training distributions and that this method prefers significantly deeper architectures while it supports orders of magnitude more efficient approximate inference than other approaches. 1 INTRODUCTION AND BACKGROUND Training good generative models and fitting them to complex and high dimensional training data with probability mass in multiple disjunct locations remains a major challenge.
    [Show full text]
  • Helmholtz Machines Memory • Fluctuation • Dreams
    Helmholtz Machines Memory • Fluctuation • Dreams Academy of Media Arts, Cologne June 28, 2006 Kevin G. Kirby Evan and Lindsay Stein Professor of Biocomputing Department of Computer Science, Northern Kentucky University DEPARTMENT OF COMPUTER SCIENCE 1 Outline • Machines Going minimalist in computer science • Memory From associations to neural connections • Fluctuation Surprise, energy, temperature, entropy • Dreams Minimizing surprise by dreaming a world Helmholtz I "Moreover, the visual image for Helmholtz is a sign, no passive copy of external things such as a photographic image, but rather a symbolic representation constructed by the mind to facilitate our physical interaction with things. [...] Helmholtz sought to establish a psychological principle capable of guiding eye movements consistent with an empiricist approach to vision as learned behavior constantly in need of some self-corrective learning procedures to adapt it to the exigencies of visual practice." -- "The Eye as Mathematician: Clinical Practice, Instrumentation, and Helmholtz's Construction of an Empiricist Theory of Vision." T. Lenoir, in Hermann von Helmholtz and the Foundations of Nineteenth- Century Science, D. Cahan (Ed.) University of California Press, 1994. 2 Helmholtz II Average Energy − Temperature × Entropy ______________________ Free Energy Machines 3 Turing Machine 1936 . STATE state 23, blank → state 42, erase, left 0 1 2 3 4 5 6 . 362 363 364 state 23, mark → state 53, mark, left state 24, blank → state 18, mark, right state 24, mark → state 93, mark, left
    [Show full text]
  • Flexible and Accurate Inference and Learning for Deep Generative Models
    Flexible and accurate inference and learning for deep generative models Eszter Vértes Maneesh Sahani Gatsby Computational Neuroscience Unit University College London London, W1T 4JG {eszter, maneesh}@gatsby.ucl.ac.uk Abstract We introduce a new approach to learning in hierarchical latent-variable generative models called the “distributed distributional code Helmholtz machine”, which emphasises flexibility and accuracy in the inferential process. In common with the original Helmholtz machine and later variational autoencoder algorithms (but unlike adverserial methods) our approach learns an explicit inference or “recognition” model to approximate the posterior distribution over the latent variables. Unlike in these earlier methods, the posterior representation is not limited to a narrow tractable parameterised form (nor is it represented by samples). To train the generative and recognition models we develop an extended wake-sleep algorithm inspired by the original Helmholtz Machine. This makes it possible to learn hierarchical latent models with both discrete and continuous variables, where an accurate posterior representation is essential. We demonstrate that the new algorithm outperforms current state-of-the-art methods on synthetic, natural image patch and the MNIST data sets. 1 Introduction There is substantial interest in applying variational methods to learn complex latent variable generative models, for which the full likelihood function (after marginalising over the latent variables) and its gradients are intractable. Unsupervised learning of such models has two complementary goals: to learn a good approximation to the distribution of the observations; and also to learn the underlying structural dependence so that the values of latent variables may be inferred from new observations. Variational methods rely on optimising a lower bound to the log-likelihood (the free energy), usually constructed by introducing an approximation of the posterior distribution (Wainwright and Jordan, arXiv:1805.11051v1 [stat.ML] 28 May 2018 2008).
    [Show full text]
  • The Helmholtz Machine
    Communicated by Michael Jordan The Helmholtz Machine Peter Dayan Geoffrey E. Hinton Radford M. Neal Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario M5S 1A4, Canada Richard S. Zemel CNL, The Salk Institute, PO Box 85800, San Diego, CA92186-5800 USA Discovering the structure inherent in a set of patterns is a fundamen- tal aim of statistical inference or learning. One fruitful approach is to build a parameterized stochastic generative model, independent draws from which are likely to produce the patterns. For all but the sim- plest generative models, each pattern can be generated in exponentially many ways. It is thus intractable to adjust the parameters to maximize the probability of the observed patterns. We describe a way of finess- ing this combinatorial explosion by maximizing an easily computed lower bound on the probability of the observations. Our method can be viewed as a form of hierarchical self-supervised learning that may relate to the function of bottom-up and top-down cortical processing pathways. 1 Introduction Following Helmholtz, we view the human perceptual system as a sta- tistical inference engine whose function is to infer the probable causes of sensory input. We show that a device of this kind can learn how to perform these inferences without requiring a teacher to label each sen- sory input vector with its underlying causes. A recognition model is used to infer a probability distribution over the underlying causes from the sensory input, and a separate generative model, which is also learned, is used to train the recognition model (Zemel 1994; Hinton and Zemel 1994; Zemel and Hinton 1995).
    [Show full text]