Boltzmann Machines, Backpropagation, Bayesian Networks
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Neural Networks for Machine Learning Lecture 12A The
Neural Networks for Machine Learning Lecture 12a The Boltzmann Machine learning algorithm Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed The goal of learning • We want to maximize the • It is also equivalent to product of the probabilities that maximizing the probability that the Boltzmann machine we would obtain exactly the N assigns to the binary vectors in training cases if we did the the training set. following – This is equivalent to – Let the network settle to its maximizing the sum of the stationary distribution N log probabilities that the different times with no Boltzmann machine external input. assigns to the training – Sample the visible vector vectors. once each time. Why the learning could be difficult Consider a chain of units with visible units at the ends w2 w3 w4 hidden w1 w5 visible If the training set consists of (1,0) and (0,1) we want the product of all the weights to be negative. So to know how to change w1 or w5 we must know w3. A very surprising fact • Everything that one weight needs to know about the other weights and the data is contained in the difference of two correlations. ∂log p(v) = s s − s s i j v i j model ∂wij Derivative of log Expected value of Expected value of probability of one product of states at product of states at training vector, v thermal equilibrium thermal equilibrium under the model. when v is clamped with no clamping on the visible units Δw ∝ s s − s s ij i j data i j model Why is the derivative so simple? • The energy is a linear function • The probability of a global of the weights and states, so: configuration at thermal equilibrium is an exponential ∂E function of its energy. -
A Boltzmann Machine Implementation for the D-Wave
A Boltzmann Machine Implementation for the D-Wave John E. Dorband Ph.D. Dept. of Computer Science and Electrical Engineering University of Maryland, Baltimore County Baltimore, Maryland 21250, USA [email protected] Abstract—The D-Wave is an adiabatic quantum computer. It is an understatement to say that it is not a traditional computer. It can be viewed as a computational accelerator or more precisely a computational oracle, where one asks it a relevant question and it returns a useful answer. The question is how do you ask a relevant question and how do you use the answer it returns. This paper addresses these issues in a way that is pertinent to machine learning. A Boltzmann machine is implemented with the D-Wave since the D-Wave is merely a hardware instantiation of a partially connected Boltzmann machine. This paper presents a prototype implementation of a 3-layered neural network where the D-Wave is used as the middle (hidden) layer of the neural network. This paper also explains how the D-Wave can be utilized in a multi-layer neural network (more than 3 layers) and one in which each layer may be multiple times the size of the D- Wave being used. Keywords-component; quantum computing; D-Wave; neural network; Boltzmann machine; chimera graph; MNIST; I. INTRODUCTION The D-Wave is an adiabatic quantum computer [1][2]. Its Figure 1. Nine cells of a chimera graph. Each cell contains 8 qubits. function is to determine a set of ones and zeros (qubits) that minimize the objective function, equation (1). -
Boltzmann Machine Learning with the Latent Maximum Entropy Principle
UAI2003 WANG ET AL. 567 Boltzmann Machine Learning with the Latent Maximum Entropy Principle Shaojun Wang Dale Schuurmans Fuchun Peng Yunxin Zhao University of Toronto University of Waterloo University of Waterloo University of Missouri Toronto, Canada Waterloo, Canada Waterloo, Canada Columbia, USA Abstract configurations of the hidden variables (Ackley et al. 1985, Welling and Teh 2003). We present a new statistical learning In this paper we will focus on the key problem of esti paradigm for Boltzmann machines based mating the parameters of a Boltzmann machine from on a new inference principle we have pro data. There is a surprisingly simple algorithm (Ack posed: the latent maximum entropy principle ley et al. 1985) for performing maximum likelihood (LME). LME is different both from Jaynes' estimation of the weights of a Boltzmann machine-or maximum entropy principle and from stan equivalently, to minimize the KL divergence between dard maximum likelihood estimation. We the empirical data distribution and the implied Boltz demonstrate the LME principle by deriving mann distribution. This classical algorithm is based on new algorithms for Boltzmann machine pa a direct gradient ascent approach where one calculates rameter estimation, and show how a robust (or estimates) the derivative of the likelihood function and rapidly convergent new variant of the with respect to the model parameters. The simplic EM algorithm can be developed. Our exper ity and locality of this gradient ascent algorithm has iments show that estimation based on LME attracted a great deal of interest. generally yields better results than maximum However, there are two significant problems with the likelihood estimation when inferring models standard Boltzmann machine training algorithm that from small amounts of data. -
Stochastic Classical Molecular Dynamics Coupled To
Brazilian Journal of Physics, vol. 29, no. 1, March, 1999 199 Sto chastic Classical Molecular Dynamics Coupled to Functional Density Theory: Applications to Large Molecular Systems K. C. Mundim Institute of Physics, Federal University of Bahia, Salvador, Ba, Brazil and D. E. Ellis Dept. of Chemistry and Materials Research Center Northwestern University, Evanston IL 60208, USA Received 07 Decemb er, 1998 Ahybrid approach is describ ed, which combines sto chastic classical molecular dynamics and rst principles DensityFunctional theory to mo del the atomic and electronic structure of large molecular and solid-state systems. The sto chastic molecular dynamics using Gener- alized Simulated Annealing GSA is based on the nonextensive statistical mechanics and thermo dynamics. Examples are given of applications in linear-chain p olymers, structural ceramics, impurities in metals, and pharmacological molecule-protein interactions. scrib e each of the comp onent pro cedures. I Intro duction In complex materials and biophysics problems the num- b er of degrees of freedom in nuclear and electronic co or- dinates is currently to o large for e ective treatmentby purely rst principles computation. Alternative tech- niques whichinvolvea hybrid mix of classical and quan- tum metho dologies can provide a p owerful to ol for anal- ysis of structure, b onding, mechanical, electrical, and sp ectroscopic prop erties. In this rep ort we describ e an implementation which has b een evolved to deal sp ecif- ically with protein folding, pharmacological molecule do cking, impurities, defects, interfaces, grain b ound- aries in crystals and related problems of 'real' solids. As in anyevolving scheme, there is still muchroom for improvement; however the guiding principles are simple: to obtain a lo cal, chemically intuitive descrip- tion of complex systems, which can b e extended in a systematic way to the nanometer size scale. -
Dynamic Behavior of Constained Back-Propagation Networks
642 Chauvin Dynamic Behavior of Constrained Back-Propagation Networks Yves Chauvin! Thomson-CSF, Inc. 630 Hansen Way, Suite 250 Palo Alto, CA. 94304 ABSTRACT The learning dynamics of the back-propagation algorithm are in vestigated when complexity constraints are added to the standard Least Mean Square (LMS) cost function. It is shown that loss of generalization performance due to overtraining can be avoided when using such complexity constraints. Furthermore, "energy," hidden representations and weight distributions are observed and compared during learning. An attempt is made at explaining the results in terms of linear and non-linear effects in relation to the gradient descent learning algorithm. 1 INTRODUCTION It is generally admitted that generalization performance of back-propagation net works (Rumelhart, Hinton & Williams, 1986) will depend on the relative size ofthe training data and of the trained network. By analogy to curve-fitting and for theo retical considerations, the generalization performance of the network should de crease as the size of the network and the associated number of degrees of freedom increase (Rumelhart, 1987; Denker et al., 1987; Hanson & Pratt, 1989). This paper examines the dynamics of the standard back-propagation algorithm (BP) and of a constrained back-propagation variation (CBP), designed to adapt the size of the network to the training data base. The performance, learning dynamics and the representations resulting from the two algorithms are compared. 1. Also in the Psychology Department, Stanford University, Stanford, CA. 94305 Dynamic Behavior of Constrained Back-Propagation Networks 643 2 GENERALIZATION PERFORM:ANCE 2.1 STANDARD BACK-PROPAGATION In Chauvin (In Press). the generalization performance of a back-propagation net work was observed for a classification task from spectrograms into phonemic cate gories (single speaker. -
Learning and Evaluating Boltzmann Machines
DepartmentofComputerScience 6King’sCollegeRd,Toronto University of Toronto M5S 3G4, Canada http://learning.cs.toronto.edu fax: +1 416 978 1455 Copyright c Ruslan Salakhutdinov 2008. June 26, 2008 UTML TR 2008–002 Learning and Evaluating Boltzmann Machines Ruslan Salakhutdinov Department of Computer Science, University of Toronto Abstract We provide a brief overview of the variational framework for obtaining determinis- tic approximations or upper bounds for the log-partition function. We also review some of the Monte Carlo based methods for estimating partition functions of arbi- trary Markov Random Fields. We then develop an annealed importance sampling (AIS) procedure for estimating partition functions of restricted Boltzmann machines (RBM’s), semi-restricted Boltzmann machines (SRBM’s), and Boltzmann machines (BM’s). Our empirical results indicate that the AIS procedure provides much better estimates of the partition function than some of the popular variational-based meth- ods. Finally, we develop a new learning algorithm for training general Boltzmann machines and show that it can be successfully applied to learning good generative models. Learning and Evaluating Boltzmann Machines Ruslan Salakhutdinov Department of Computer Science, University of Toronto 1 Introduction Undirected graphical models, also known as Markov random fields (MRF’s), or general Boltzmann ma- chines, provide a powerful tool for representing dependency structure between random variables. They have successfully been used in various application domains, including machine learning, computer vi- sion, and statistical physics. The major limitation of undirected graphical models is the need to compute the partition function, whose role is to normalize the joint probability distribution over the set of random variables. -
ARCHITECTS of INTELLIGENCE for Xiaoxiao, Elaine, Colin, and Tristan ARCHITECTS of INTELLIGENCE
MARTIN FORD ARCHITECTS OF INTELLIGENCE For Xiaoxiao, Elaine, Colin, and Tristan ARCHITECTS OF INTELLIGENCE THE TRUTH ABOUT AI FROM THE PEOPLE BUILDING IT MARTIN FORD ARCHITECTS OF INTELLIGENCE Copyright © 2018 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Acquisition Editors: Ben Renow-Clarke Project Editor: Radhika Atitkar Content Development Editor: Alex Sorrentino Proofreader: Safis Editing Presentation Designer: Sandip Tadge Cover Designer: Clare Bowyer Production Editor: Amit Ramadas Marketing Manager: Rajveer Samra Editorial Director: Dominic Shakeshaft First published: November 2018 Production reference: 2201118 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK ISBN 978-1-78913-151-2 www.packt.com Contents Introduction ........................................................................ 1 A Brief Introduction to the Vocabulary of Artificial Intelligence .......10 How AI Systems Learn ........................................................11 Yoshua Bengio .....................................................................17 Stuart J. -
Cogsci Little Red Book.Indd
OUTSTANDING QUESTIONS IN COGNITIVE SCIENCE A SYMPOSIUM HONORING TEN YEARS OF THE DAVID E. RUMELHART PRIZE IN COGNITIVE SCIENCE August 14, 2010 Portland Oregon, USA OUTSTANDING QUESTIONS IN COGNITIVE SCIENCE A SYMPOSIUM HONORING TEN YEARS OF THE DAVID E. RUMELHART PRIZE IN COGNITIVE SCIENCE August 14, 2010 Portland Oregon, USA David E. Rumelhart David E. Rumelhart ABOUT THE RUMELHART PRIZE AND THE 10 YEAR SYMPOSIUM At the August 2000 meeting of the Cognitive Science Society, Drs. James L. McClelland and Robert J. Glushko presented the initial plan to honor the intellectual contributions of David E. Rumelhart to Cognitive Science. Rumelhart had retired from Stanford University in 1998, suffering from Pick’s disease, a degenerative neurological illness. The plan involved honoring Rumelhart through an annual prize of $100,000, funded by the Robert J. Glushko and Pamela Samuelson Foundation. McClelland was a close collaborator of Rumelhart, and, together, they had written numerous articles and books on parallel distributed processing. Glushko, who had been Rumelhart’s Ph.D student in the late 1970s and a Silicon Valley entrepreneur in the 1990s, is currently an adjunct professor at the University of California – Berkeley. The David E. Rumelhart prize was conceived to honor outstanding research in formal approaches to human cognition. Rumelhart’s own seminal contributions to cognitive science included both connectionist and symbolic models, employing both computational and mathematical tools. These contributions progressed from his early work on analogies and story grammars to the development of backpropagation and the use of parallel distributed processing to model various cognitive abilities. Critically, Rumelhart believed that future progress in cognitive science would depend upon researchers being able to develop rigorous, formal theories of mental structures and processes. -
Introspection:Accelerating Neural Network Training by Learning Weight Evolution
Published as a conference paper at ICLR 2017 INTROSPECTION:ACCELERATING NEURAL NETWORK TRAINING BY LEARNING WEIGHT EVOLUTION Abhishek Sinha∗ Mausoom Sarkar Department of Electronics and Electrical Comm. Engg. Adobe Systems Inc, Noida IIT Kharagpur Uttar Pradesh,India West Bengal, India msarkar at adobe dot com abhishek.sinha94 at gmail dot com Aahitagni Mukherjee∗ Balaji Krishnamurthy Department of Computer Science Adobe Systems Inc, Noida IIT Kanpur Uttar Pradesh,India Uttar Pradesh, India kbalaji at adobe dot com ahitagnimukherjeeam at gmail dot com ABSTRACT Neural Networks are function approximators that have achieved state-of-the-art accuracy in numerous machine learning tasks. In spite of their great success in terms of accuracy, their large training time makes it difficult to use them for various tasks. In this paper, we explore the idea of learning weight evolution pattern from a simple network for accelerating training of novel neural networks. We use a neural network to learn the training pattern from MNIST classifi- cation and utilize it to accelerate training of neural networks used for CIFAR-10 and ImageNet classification. Our method has a low memory footprint and is computationally efficient. This method can also be used with other optimizers to give faster convergence. The results indicate a general trend in the weight evolution during training of neural networks. 1 INTRODUCTION Deep neural networks have been very successful in modeling high-level abstractions in data. How- ever, training a deep neural network for any AI task is a time-consuming process. This is because a arXiv:1704.04959v1 [cs.LG] 17 Apr 2017 large number of parameters need to be learnt using training examples. -
Learning Algorithms for the Classification Restricted Boltzmann
JournalofMachineLearningResearch13(2012)643-669 Submitted 6/11; Revised 11/11; Published 3/12 Learning Algorithms for the Classification Restricted Boltzmann Machine Hugo Larochelle [email protected] Universite´ de Sherbrooke 2500, boul. de l’Universite´ Sherbrooke, Quebec,´ Canada, J1K 2R1 Michael Mandel [email protected] Razvan Pascanu [email protected] Yoshua Bengio [email protected] Departement´ d’informatique et de recherche operationnelle´ Universite´ de Montreal´ 2920, chemin de la Tour Montreal,´ Quebec,´ Canada, H3T 1J8 Editor: Daniel Lee Abstract Recent developments have demonstrated the capacity of restricted Boltzmann machines (RBM) to be powerful generative models, able to extract useful features from input data or construct deep arti- ficial neural networks. In such settings, the RBM only yields a preprocessing or an initialization for some other model, instead of acting as a complete supervised model in its own right. In this paper, we argue that RBMs can provide a self-contained framework for developing competitive classifiers. We study the Classification RBM (ClassRBM), a variant on the RBM adapted to the classification setting. We study different strategies for training the ClassRBM and show that competitive classi- fication performances can be reached when appropriately combining discriminative and generative training objectives. Since training according to the generative objective requires the computation of a generally intractable gradient, we also compare different approaches to estimating this gradient and address the issue of obtaining such a gradient for problems with very high dimensional inputs. Finally, we describe how to adapt the ClassRBM to two special cases of classification problems, namely semi-supervised and multitask learning. -
An Efficient Learning Procedure for Deep Boltzmann Machines
ARTICLE Communicated by Yoshua Bengio An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov [email protected] Department of Statistics, University of Toronto, Toronto, Ontario M5S 3G3, Canada Geoffrey Hinton [email protected] Department of Computer Science, University of Toronto, Toronto, Ontario M5S 3G3, Canada We present a new learning algorithm for Boltzmann machines that contain many layers of hidden variables. Data-dependent statistics are estimated using a variational approximation that tends to focus on a sin- gle mode, and data-independent statistics are estimated using persistent Markov chains. The use of two quite different techniques for estimating the two types of statistic that enter into the gradient of the log likelihood makes it practical to learn Boltzmann machines with multiple hidden lay- ers and millions of parameters. The learning can be made more efficient by using a layer-by-layer pretraining phase that initializes the weights sensibly. The pretraining also allows the variational inference to be ini- tialized sensibly with a single bottom-up pass. We present results on the MNIST and NORB data sets showing that deep Boltzmann machines learn very good generative models of handwritten digits and 3D objects. We also show that the features discovered by deep Boltzmann machines are a very effective way to initialize the hidden layers of feedforward neural nets, which are then discriminatively fine-tuned. 1 A Brief History of Boltzmann Machine Learning The original learning procedure for Boltzmann machines (see section 2) makes use of the fact that the gradient of the log likelihood with respect to a connection weight has a very simple form: it is the difference of two pair-wise statistics (Hinton & Sejnowski, 1983). -
Machine Learning Algorithms Based on Generalized Gibbs Ensembles
Machine learning algorithms based on generalized Gibbs ensembles Tatjana Puˇskarov∗and Axel Cort´esCuberoy Institute for Theoretical Physics, Center for Extreme Matter and Emergent Phenomena, Utrecht University, Princetonplein 5, 3584 CC Utrecht, the Netherlands Abstract Machine learning algorithms often take inspiration from the established results and knowledge from statistical physics. A prototypical example is the Boltzmann machine algorithm for supervised learn- ing, which utilizes knowledge of classical thermal partition functions and the Boltzmann distribution. Recently, a quantum version of the Boltzmann machine was introduced by Amin, et. al., however, non- commutativity of quantum operators renders the training process by minimizing a cost function inefficient. Recent advances in the study of non-equilibrium quantum integrable systems, which never thermalize, have lead to the exploration of a wider class of statistical ensembles. These systems may be described by the so-called generalized Gibbs ensemble (GGE), which incorporates a number of “effective temperatures". We propose that these GGEs can be successfully applied as the basis of a Boltzmann-machine{like learn- ing algorithm, which operates by learning the optimal values of effective temperatures. We show that the GGE algorithm is an optimal quantum Boltzmann machine: it is the only quantum machine that circumvents the quantum training-process problem. We apply a simplified version of the GGE algorithm, where quantum effects are suppressed, to the classification of handwritten digits in the MNIST database. While lower error rates can be found with other state-of-the-art algorithms, we find that our algorithm reaches relatively low error rates while learning a much smaller number of parameters than would be needed in a traditional Boltzmann machine, thereby reducing computational cost.