Fundamentals of Learning Algorithms in Boltzmann Machines

by Mihaela G. Erbiceanu

M. Eng., "Gheorghe Asachi" Technical University, 1991

Project Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Computing Science

in the School of Computing Science Faculty of Applied Sciences

© Mihaela G. Erbiceanu 2016 SIMON FRASER UNIVERSITY Fall 2016 Approval

Name: Mihaela Erbiceanu

Degree: Master of Science (Computing Science)

Title: Fundamentals of Learning Algorithms in Boltzmann Machines

Examining Committee: Chair: Binay Bhattacharya Professor Petra Berenbrink Senior Supervisor Professor Andrei Bulatov Supervisor Professor Leonid Chindelevitch External Examiner Assistant Professor

Date Defended/Approved: September 7, 2016

ii

Abstract

Boltzmann learning underlies an artificial neural network model known as the Boltzmann machine that extends and improves upon the model. Boltzmann machine model uses binary units and allows for the existence of hidden units to represent latent variables. When subjected to reducing noise via and allowing uphill steps via Metropolis algorithm, the training algorithm increases the chances that, at thermal equilibrium, the network settles on the best distribution of parameters. The existence of equilibrium distribution for an asynchronous Boltzmann machine is analyzed with respect to . Two families of learning algorithms, which correspond to two different approaches to compute the statistics required for learning, are presented. The learning algorithms based only on stochastic approximations are traditionally slow. When variational approximations of the free energy are used, like the mean field approximation or the Bethe approximation, the performance of learning improves considerably. The principal contribution of the present study is to provide, from a rigorous mathematical perspective, a unified framework for these two families of learning algorithms in asynchronous Boltzmann machines.

Keywords: Boltzmann–Gibbs distribution, , asynchronous Boltzmann machine, thermal equilibrium, data–dependent statistics, data–independent statistics, stochastic approximation, variational method, mean field approximation, Bethe approximation.

iii

Dedication

This thesis is dedicated to my mother for her support, sacrifice, and constant love.

iv

Acknowledgements

First and foremost, I would like to thank my supervisor Petra Berenbrink not only for giving me the opportunity to work on this thesis under her supervision, but also for her valuable feedback. I would also like to thank Andrei Bulatov, my second supervisor, and my committee members, Leonid Chindelevitch and Binay Bhattacharya, for their support, encouragement, and patience.

v

Table of Contents

Approval ...... ii Abstract ...... iii Dedication ...... iv Acknowledgements ...... v Table of Contents ...... vi List of Tables ...... viii List of Figures ...... ix List of Acronyms ...... x

Chapter 1. Introduction ...... 1 1.1 Motivation ...... 1 1.2 Overview and roadmap ...... 3 1.3 Related work ...... 4 1.4 Connection to other disciplines ...... 6

Chapter 2. Foundations ...... 7 2.1 Boltzmann–Gibbs distribution ...... 7 2.2 Markov random fields and Gibbs measures ...... 10 2.3 Gibbs free energy ...... 15 2.4 Connectionist networks ...... 19 2.5 Hopfield networks ...... 21 2.5.1 Hopfield network models ...... 22 2.5.2 Convergence of the Hopfield network ...... 28

Chapter 3. Variational methods for Markov networks ...... 32 3.1 Pairwise Markov networks as exponential families ...... 32 3.1.1 Basics of exponential families ...... 33 3.1.2 Canonical representation of pairwise Markov networks ...... 34 3.1.3 Mean parameterization of pairwise Markov networks ...... 35 3.1.4 The role of transformations between parameterizations ...... 37 3.2 The energy functional...... 38 3.3 Gibbs free energy revisited ...... 42 3.3.1 Hamiltonian and Plefka expansion ...... 43 3.3.2 The Gibbs free energy as a variational energy ...... 45 3.4 Mean field approximation ...... 48 3.4.1 The mean field energy functional ...... 48 3.4.2 Maximizing the energy functional: fixed–point characterization ...... 50 3.4.3 Maximizing the energy functional: the naïve mean field algorithm ...... 55 3.5 Bethe approximation ...... 57 3.5.1 The Bethe free energy ...... 58 3.5.2 The Bethe–Gibbs free energy ...... 60 3.5.3 The relationship between belief propagation fixed–points and Bethe free energy...... 62 3.5.4 Belief optimization ...... 66

Chapter 4. Introduction to Boltzmann Machines ...... 68 4.1 Definitions ...... 68 4.2 Modelling the underlying structure of an environment ...... 74 4.3 Representation of a Boltzmann Machine as an energy–based model ...... 77

vi

4.4 How a Boltzmann Machine models data ...... 82 4.5 General dynamics of Boltzmann Machines ...... 84 4.6 The biological interpretation of the model ...... 90

Chapter 5. The Mathematical Theory of Learning Algorithms for Boltzmann Machines ...... 93 5.1 Problem description ...... 93 5.2 Phases of a learning algorithm in a Boltzmann Machine ...... 97 5.3 Learning algorithms based on approximate maximum likelihood ...... 99 5.3.1 Learning by minimizing the KL–divergence of Gibbs measures ...... 100 5.3.2 Collecting the statistics required for learning ...... 114 5.4 The equilibrium distribution of a Boltzmann machine ...... 121 5.5 Learning algorithms based on variational approaches...... 130 5.5.1 Using variational free energies to compute the statistics required for learning ...... 131 5.5.2 Learning by naïve mean field approximation ...... 134 5.6 Unlearning and relearning in Boltzmann Machines ...... 138

Chapter 6. Conclusions ...... 141 6.1 Summary of what has been done ...... 141 6.2 Future directions ...... 143

References ...... 144

Appendix A: Mathematical notations ...... 149

Appendix B: Probability theory and statistics ...... 150

Appendix C: Finite Markov chains ...... 158

vii

List of Tables

Table 1 Distributions of interest in asynchronous Boltzmann ...... 95 Table 2 Transition probability matrices for asynchronous symmetric Boltzmann machines...... 124

viii

List of Figures

Figure 1 a) A fully–connected Boltzmann machine with three visible nodes and four hidden nodes; b) A layered Boltzmann machine with one visible layer and two hidden layers...... 77

ix

List of Acronyms

SFU Simon Fraser University

LAC Library and Archives Canada

BO belief optimization

BP belief propagation

CD contrastive divergence

KL–divergence Kullback–Leibler divergence

LBP loopy belief propagation

MCMC

ML maximum likelihood

x

Chapter 1. Introduction

1.1 Motivation

Boltzmann machines are a particular class of artificial neural networks that have been extensively studied, because of the interesting properties of the associated learning algorithms. In this context, learning for Boltzmann machines means “acquiring a particular behavior by observing it” [1]. The machine is named after who discovered the fundamental law governing the equilibrium state of a gas. The distribution of molecules of an ideal gas among the various energy states is called the Boltzmann–Gibbs distribution. This distribution was proposed by and Terrence Sejnowski as the stochastic update rule in a new network which they named “Boltzmann Machine”.

From a pure theoretical point of view, a Boltzmann machine is a generalization of a Hopfield network in which the units update their states according to a stochastic decision rule and which allows the presence of hidden units.

From a point of view, a Boltzmann machine is a binary pairwise in which every node is endowed with a non–linear activation function similar to an activation model for neurons. As a graphical model, a Boltzmann machine has both a structural component, encoded by the pattern of edges in the underlying graph, and a parametric component, encoded by the potentials associated with sets of edges in the underlying graph. The particularities of the Boltzmann machine model that make it suitable for pattern recognition tasks are due more to its parameterization than to its conditional independence structure.

The main interest for Boltzmann machine has come from the neural network field where a particular type of Boltzmann machine – the layered Boltzmann machine – is considered a deep neural network. The learning algorithms for Boltzmann machine have been mostly created to train this kind of neural network.

Boltzmann machines are theoretically intriguing because of the locality and Hebbian1 nature of their training algorithm, and because of their parallelism and the resemblance of their dynamics to simple physical processes [2]. There is, however, one drawback in the use of learning process in Boltzmann machines: the process is computationally very expensive. The

1 is a theory in neuroscience that proposes an explanation for the adaptation of neurons in the brain during the learning process. See Section 2.4 for more information.

1

computational complexity of the exact algorithm is exponential in the number of neurons because involves the computation of the partition function of the Boltzmann–Gibbs distribution, which involves a sum over all states in the network, of which there are exponentially many. If a learning algorithm uses an approximate inference method to compute the partition function of the Boltzmann–Gibbs distribution, then the learning process can be made efficient enough to be useful for practical problems.

Based on the approach employed to compute the statistics required for learning, the Boltzmann machine learning algorithms are divided into two groups or families. One family of algorithms uses only stochastic approximations; the other family uses both variational approximations and stochastic approximations to compute the statistics.

The goal of this paper is to present, from a rigorous mathematical perspective, a unified framework for two families of learning algorithms in asynchronous Boltzmann. Precursors of our approach are: Sussmann, who elaborated in 1988-1989 a rigorous mathematical analysis of the original asynchronous Boltzmann machine learning algorithm [1,3]; Welling and Teh who, in 2002, reviewed the inference algorithms in Boltzmann machines with emphasis on the advanced mean field methods and the loopy belief propagation [4]; Salakhutdinov, who, in 2008, in the context of arbitrary Markov random fields, reviewed some Monte Carlo based methods for estimating partition functions, as well as the variational framework for obtaining deterministic approximations or upper bounds for the log partition function [5]. During 1990s and early 2000s the subject of learning algorithms for Boltzmann machines had been somewhat neglected by the research community. However, a promising new method to train deep neural networks proposed in 2006 by Hinton et al. [6] has caused a resurgence of interest in this subject. Despite its ups and downs as a research subject, a considerable number of papers on the learning algorithms for Boltzmann machines have been published. Some of these papers proposed refinements for existing algorithms; others proposed new algorithms or even completely new approaches.

However, as far as we know, there has not been yet a documented effort to gather in one place, with a consistent set of definitions and notations, and built on a unified framework of concepts, proofs, and interpretation of results, the mathematical foundations of the main families of Boltzmann machine learning algorithms. By approaching the topic of this paper from a computer science theoretical perspective, but without omitting the intuition behind, we intend to fill this void and to help other interested parties to obtain a good understanding of the intricacies and limitations of Boltzmann machines and their learning algorithms.

2

1.2 Overview and roadmap

This paper consists of six chapters and three appendices and is organized as follows.

In Chapter 1, we present an introduction to the topic of this paper and our goals in covering it. We also include a brief history of Boltzmann machine learning and what connections it has with other disciplines.

In Chapter 2, we introduce the Boltzmann–Gibbs distribution as the main source of inspiration for Boltzmann machine. We also review the main concepts and results from Markov random field theory that are subsequently used in this paper. Then we introduce the Gibbs free energy and its intrinsic relationship with the Boltzmann–Gibbs distribution. Furthermore we introduce the precursors of Boltzmann machine: the connectionist networks and the Hopfield networks. Because the asynchronous Hopfield’s network represents the limiting case of the asynchronous Boltzmann machine as the “temperature” parameter 퐓 → 0, we cover the dynamics and convergence of Hopfield networks as well as its learning algorithms.

In Chapter 3, we start by introducing the basics of variational methodology. We also explain how, in certain conditions, the Gibbs free energy can be viewed as a variational energy. Then we review the main concepts and results regarding two classes of variational methods that are used by Boltzmann machine learning algorithms to approximate the free energy of a Markov random field: the mean field approximation and the Bethe approximation.

In Chapter 4, we introduce the asynchronous Boltzmann machine. From formal definitions, how the underlying environment is modeled, the energy–based representation, the data representation, to general dynamics, we provide a detailed description of its functionality, without omitting the intuition behind its concepts and algorithms. We end this chapter with the biological interpretation of the model as it was given by Hinton.

In Chapter 5 we start by formally defining the process of learning and justifying why the Boltzmann machine learning algorithms have two phases. Then we present two categories of learning algorithms for asynchronous Boltzmann machine: based on Monte Carlo methods and based on variational approximations of the free energy, specifically the mean field approximation and the Bethe approximation. For each category we present the derivation and analysis of the original algorithm. Other important algorithms from each category are introduced by presenting their differences and/or improvements comparative to the original algorithm. Finally, we cover the processes of unlearning and relearning in asynchronous Boltzmann machine.

3

Finally, Chapter 6 contains a very brief summary and outlook.

In Appendix A we introduce the mathematical notations used throughout this paper.

In Appendix B we review the main concepts from probability theory and statistics that are necessary to have a good understanding of the paper.

In Appendix C we review the main concepts regarding finite Markov chains that are necessary to have a good understanding of this paper.

1.3 Related work

In 1982, Hopfield showed that a network of symmetrically–coupled binary threshold units has a simple quadratic energy function that governs its dynamic behavior [7]. When the nodes are deterministically updated one at a time, the network settles to an energy minimum and Hopfield suggested using these minima to store content–addressable memories.

Hinton and Sejnowski realized that the energy function can be viewed as an indirect way of defining a probability distribution over all the binary configurations of the network and that, if the right stochastic updating rule is used, the dynamics eventually produces samples from the Boltzmann–Gibbs distribution [8-9]. This discovery led them to invent in 1983 the “Boltzmann Machine” [8-9]. Furthermore, if a Boltzmann machine is divided into a set of visible nodes whose states are externally forced or “clamped” at the data and a disjoint set of hidden nodes, the stochastic updating produces samples from the posterior distribution over configurations of the hidden nodes given the current data [8,10-11]. Ackley, Hinton, and Sejnowski proposed a learning algorithm that performs maximum likelihood learning of the weights that define the hidden nodes and uses sequential to approach the posterior distribution [10]. This new algorithm is known in literature as the original learning algorithm/procedure for (asynchronous) Boltzmann machines. Inspired by Kirkpatrick, Gelatt, and Vecchi [12], Hinton and Sejnowski used simulated annealing from a high initial “temperature” to a final “temperature” of 1 to speed up convergence to the stationary distribution. They demonstrated that this was a feasible way of learning the weights in small networks. However, the original learning procedure was still much too slow to be practical for learning large, multilayer Boltzmann machines [13]. The simplicity and locality of original learning procedure for Boltzmann machines led to much interest, but the settling time required getting samples from the right distribution and the high noise in the estimates made learning slow and unreliable [5].

4

During the following two decades the researchers tried to improve the learning speed of Boltzmann machine by using various approaches.

In 1992 Neal improved the original learning procedure by using persistent Markov chains [14]. Neal did not explicitly use simulated annealing. However, the persistent Markov chains implement it implicitly, provided that the weights have small initial values. Neal showed that persistent Markov chains work quite well for training a Boltzmann machine on a fairly small data set [14]. For large data sets, however, it is much more efficient to update the weights after a small mini batch of training examples [13].

The first efficient learning procedure for large–scale asynchronous Boltzmann machines used an extremely limited architecture named Restricted Boltzmann Machine. This architecture together with its learning procedure was first proposed by Smolensky in 1986 [15] and it was designed to make inference tractable [13].

In 1987, in an attempt to reduce the time required by the sampling process, Peterson and Anderson [16-17] replaced Gibbs sampling with a simple mean field method that approximates a stationary distribution by replacing stochastic binary values with deterministic real–valued probabilities. More sophisticated deterministic approximation methods were investigated by Galland in 1990 [18-19], Kappen and Rodriguez in 1998 [20-21], and Tanaka in 1998 [22-23] but none of these approximations worked very well for learning for reasons that were not well understood at the time [13]. Similar deterministic approximation methods were studied intensively in 1990s in the context of directed graphical models learning [24-27]. In 2010 Salakhutdinov interpreted these results and provided a possible explanation of the limited success of using deterministic approximation methods for learning in asynchronous Boltzmann machines [13].

Because variational methods typically scale well to large applications, during 2000s extended research has been done for obtaining deterministic approximations [28-29] or deterministic upper bounds [30-32] on the log partition function of an arbitrary discrete Markov random field. The Bethe approximation first made its appearance in the field of approximate inference and error correcting decoding in [33-34] under the names TAP approximation and cavity method. The relation between belief propagation and the Bethe approximation was further clarified in [28-29,35-36] where it was shown that belief propagation, even when applied to loopy graphs, has fixed–points at the stationary points of the Bethe free energy. In 2003 Welling and Teh proposed a new algorithm, named belief optimization, to minimize the Bethe free energy directly, as an alternative to the fixed–point equations of belief propagation [4].

5

In 2002 Hinton proposed a new learning algorithm for the asynchronous Boltzmann machine: contrastive divergence learning. In his view this new algorithm works as a better approximation to the maximum likelihood learning used in the original learning algorithm [37]. The most attractive aspect of this new algorithm is that it allows Restricted Boltzmann Machines with millions of parameters to achieve state–of–the–art performance on a large collaborative filtering task [38].

The newest variant of asynchronous Boltzmann machine, called Deep Boltzmann Machine, is a deep multilayer Boltzmann machine that was proposed in 2009 by Salakhutdinov and Hinton [39]. Its learning algorithm was designed to incorporate both bottom–up and top–down feedback, allowing a better propagation of uncertainty about ambiguous inputs [39].

The dynamics of the synchronous Boltzmann machine were first studied in the 1970s by Little and Shaw [40-41]. A comprehensive study of synchronous Boltzmann machines and their learning algorithms was done by Viveros in her PhD thesis in 2001 [42].

1.4 Connection to other disciplines

We previously mentioned that the learning algorithms for Boltzmann machines have been intensively used for training deep neural networks. A deep neural network is an artificial neural network with multiple hidden layers of units between the input and output layers which is capable of learning the underlying constraints that characterize a domain simply by being shown examples from the domain [10-11,43]. Computational is closely related to a class of theories of brain development named neocortical development that was proposed by cognitive neuroscientists in the early 1990s. Neocortical development is a major focus in neurobiology, not only from a purely developmental standpoint, but also because understanding how neocortex develops provides important insight into mature neocortical organization and function. This shows how Boltzmann machines and their learning algorithms are connected with neurobiology and, generally, with the field of cognitive sciences.

When applied to Boltzmann machines and their learning algorithms, the emphasis on mathematical technique and rigor employed by theoretical computing science becomes an invaluable research asset.

6

Chapter 2. Foundations

2.1 Boltzmann–Gibbs distribution

In and mathematics, the Boltzmann–Gibbs distribution (also called the or the Gibbs distribution) is a certain distribution function or probability measure for the distribution of the states of a system. The Boltzmann distribution is named after Ludwig Boltzmann who first formulated it in 1868 during his studies of statistical mechanics of gases in thermal equilibrium [44]. The distribution was later investigated extensively, in its modern generic form, by Josiah Willard Gibbs (1902) [45]. It underpins the concept of the by providing the underlying distribution. In more general mathematical settings, the Boltzmann–Gibbs distribution is also known as the Gibbs measure.

In statistical mechanics the Boltzmann–Gibbs distribution is an intrinsic characteristic of isolated (or nearly–isolated) systems of fixed composition that are in thermal equilibrium (i.e., equilibrium with respect to the energy exchange). The most general case of such system is the canonical ensemble. Before we define the concept of canonical ensemble, we need to define a concept that is employed by its definition, specifically heat bath.

Definition 2.1:

In thermodynamics, a heat bath is a system 퐵 which is in contact with a many–particle system 퐴 such that:

 퐴 and 퐵 can exchange energy, but not particles;  퐵 is at equilibrium and has temperature 퐓;  퐵 is much larger than 퐴, so that its contact with 퐴 does not affect its equilibrium state.

Definition 2.2:

In statistical mechanics, a canonical ensemble is the statistical ensemble that represents the possible states of a mechanical system in thermal equilibrium with a heat bath at some fixed temperature.

7

From previous definitions we can infer that the states of the system 퐴, which plays the role of a canonical ensemble, will differ in their total energy as a consequence of the energy exchange with the system 퐵, which plays the role of a heat bath. The principal thermodynamic variable of the canonical ensemble, determining the probability distribution of states, is the absolute temperature 퐓. In general, the canonical ensemble 푋 assigns to each distinct microstate 푥 a probability 퐏 (equivalently of the 푋 having the value 푥) given by the following exponential:

퐹 − 퐸(푥) 퐏(푋 = 푥) = exp ( ) (2.1) 퐤 ∙ 퐓 where: 퐸(푥) is the energy of the microstate 푥; 퐹 is the ; 퐓 is the absolute temperature of the system; and 퐤 is the Boltzmann's constant. 퐸(푥) is a function that maps the space of states to ℝ and is interpreted as the energy of state 푥. For a given ensemble the Helmholtz free energy is a constant.

If the canonical ensemble has 푚 states accessible to the system of interest indexed by {1,2, … 푚}, then the equation (2.1) can be rewritten as:

−퐸 exp ( 푖 ) 퐏 = 퐤 ∙ 퐓 풊 −퐸 (2.2) ∑푚 exp ( 푗 ) 푗=1 퐤 ∙ 퐓 where: 퐏풊 is the probability of state 푖; 퐸푖 is the energy of the state 푖; 퐤 is the Boltzmann’s constant; 퐓 is the absolute temperature of the system; and 푚 is the number of states of the canonical ensemble.

An alternative but equivalent formulation for canonical ensemble uses the canonical partition function or normalization constant 푍 rather than the free energy and is described below:

−퐹 푍(퐓) = exp ( ) (2.3) 퐤 ∙ 퐓

For a canonical ensemble with 푚 states, if we know the energy of the states accessible to the system of interest, we can calculate the canonical partition function 푍 as follows:

푚 −퐸푗 푍(퐓) = ∑ exp ( ) (2.4) 퐤 ∙ 퐓 푗=1

By introducing the canonical partition function defined by the equation (2.3) respectively (2.4) into the equation (2.1) respectively (2.2) we obtain:

8

1 −퐸(푥) 퐏(푋 = 푥) = ∙ exp ( ) (2.5) 푍(퐓) 퐤 ∙ 퐓 respectively: 1 −퐸 퐏 = ∙ exp ( 푖 ) (2.6) 풊 푍(퐓) 퐤 ∙ 퐓

In a system with local (finite–range) interactions, the canonical ensemble’s distribution maximizes the density for a given expected energy density, or equivalently, minimizes the free energy density. The distribution shows that states with lower energy will always have a higher probability of being occupied than the states with higher energy.

The Boltzmann–Gibbs distribution is often used to describe the distribution of particles, such as atoms in binary alloys or molecules in a gas, over energy states accessible to them. If we have a system consisting of a finite number of particles, the probability of a particle being in state 푖 is practically the probability that, if we pick a random particle from that system and check what state is in, we will find that it is in state 푖. This probability is equal to the number of particles in state 푖 divided by the total number of particles in the system, which is the fraction of particles that occupy state 푖. The formula (2.7) gives the fraction of particles in state 푖 as a function of the state’s energy:

−퐸 exp ( 푖 ) 푛푖 k ∙ 퐓 퐏 = = (2.7) 풊 푛 −퐸 ∑푚 exp ( 푗 ) 푗=1 k ∙ 퐓 where 푛 is the total number of particles in the system and 푛푖 is the number of particles in state 푖.

In infinite systems, the total energy is no longer a finite number and cannot be used in the traditional construction of the probability distribution of a canonical ensemble. The traditional approach, followed by statistical physicists, of studying the thermodynamic limit of the energy function as the size of a finite system approaches infinity, had not been very useful. Looking for an alternative approach, the researchers discovered that, when the energy function of an infinite system can be written as a sum of terms that each involves only variables from a finite subsystem, the notion of Gibbs measure provides a framework to directly study such systems (instead of taking the limit of finite systems).

9

Definition 2.3:

In physics, a probability measure is a Gibbs measure if the conditional probabilities it induces on each finite subsystem satisfy the following consistency condition: if all degrees of freedom outside the finite subsystem are frozen, the canonical ensemble for the subsystem subject to these boundary conditions matches the probabilities in the Gibbs measure conditional on the frozen degrees of freedom.

2.2 Markov random fields and Gibbs measures

We have seen that the Gibbs measure has a native relationship with physics: it was born to describe the behavior of a system whose interaction between particles can be described by a form of energy. More, the Gibbs measure can be applied successfully to systems outside its domain of origin, sometimes even without introducing notions specific to physics into the probabilistic definitions of those systems. Examples of such systems are: Hopfield networks, Markov random fields, and Markov logic networks. All these systems exploit the following general principle derived from Boltzmann’s and Gibbs’s work: a network consisting of a large number of units, with each unit interacting with neighbouring units, will approach at equilibrium a canonical distribution given by the equations (2.5) and (2.6). This expanded applicability of Gibbs measure has been made possible by a fundamental mathematical result known as the theorem Hammersley–Clifford or the fundamental theorem of random fields. In this section we present how the computer scientists adapted the physicists’ definition of Gibbs measure for graphical models. We also present the theorem Hammersley–Clifford and its consequences with respect to the special class of Markov random fields that is Boltzmann machine.

Dobrushin showed in [46] that, apparently, there are two different ways to define configurations of points on a structure that mathematically resemble a lattice; he called these configurations “random fields”. One way is based on the formulation of statistical mechanics of Gibbs and is generally accepted as the simplest useful mathematical model of a discrete gas (also called lattice gas) [46]. The other way, introduced by Dobrushin himself, is that of Markov random fields. Dobrushin’s formulation has no apparent connection with physics, being instead based on the natural way of extending the notion of a Markov process [46].

A Markov process is a stochastic model that has the Markov property, i.e., the conditional probability distribution of future states of the process (conditional on both past and present

10

states) depends only upon the present state, not on the sequence of events that preceded it. A special case of Markov process is the Markov chain.

A Markov chain is a discrete–time Markov process with a countable or finite state space.

A Markov random field, also called Markov network, extends the Markov chain to two or more dimensions or to random variables defined for an interconnected network of items; therefore, it may be considered a generalization of a Markov chain in multiple dimensions.

In this paper we use the term Markov random field to designate a Markov random field that models an interconnected network of items. In a Markov chain, each state depends only on the previous state in time, whereas in a Markov random field each state depends only on its neighbors in any of multiple directions. Hence, a Markov random field may be visualized as a field or graph of random variables, where the distribution of each random variable depends on the neighboring variables which it is connected with. Thus, in a Markov random field the Markov property becomes a local property rather than a temporal property.

Any graphical model can be seen as a “marriage” between probability theory and graph theory. A consequence of this relationship is the existence of two equivalent characterizations of the family of probability distributions associated with an undirected graph: one algebraic that involves the concept of factorization and one graph–theory specific that involves the concept of reachability [27,47]. For Markov random fields the concepts of reachability respectively factorization are identified with conditional independence respectively factor graph representation. The theorem Hammersley–Clifford shows that these two ways of defining a random field are equivalent, which further translates into equivalence between Markov random fields and Gibbs measures. Before we present the theorem Hammersley–Clifford, we formally introduce the concepts it operates with: Markov random field and Gibbs measure. We use the notations for univariate and multivariate random variables specified in Appendix A.

Definition 2.4:

Given an undirected graph 퐺 = (푉, 퐸), a set of random variables X = (푋푣)푣∈푉 indexed by 푉 form a Markov random field with respect to 퐺 if they satisfy the Markov property expressed in either one of the following forms:

 Pairwise Markov Property: Any two non–adjacent variables are conditionally independent given all other variables:

11

푋푢 ⊥ 푋푣 | X푉−{푢,푣} if {푢, 푣} ∉ 퐸 (2.8)

 Local Markov Property: A variable is conditionally independent of all other variables given its neighbors:

푋푣 ⊥ X푉−cl(푣) | Xne(푣) (2.9)

where ne(푣), also called the Markov blanket, is the set of neighbors of 푣 and cl(푣) = ne(푣) ∪ {푣} is the closed neighborhood of 푣.

 Global Markov Property: Any two subsets of variables are conditionally independent given a separating subset:

XA ⊥ XB | XS (2.10)

where every path from a node in 퐴 to a node in 퐵 passes through 푆.

Generally, these three expressions of Markov property are not equivalent. The local Markov property is stronger than the pairwise one, while weaker than the global one.

Definitions 2.5:

A probability distribution 퐏(X) = 퐏(푋1, 푋2, … , 푋푛) on an undirected graph 퐺 = (푉, 퐸) with |푉| = 푛 is called a Gibbs distribution or Gibbs measure if it can be factorized into potentials defined on cliques that cover all the nodes and edges of 퐺.

A potential function or sufficient statistic is a function defined on the set of configurations of a clique (i.e., a setting of values for all the nodes in the clique) that associates a positive real number with each configuration. Hence, for every subset of nodes Xc ⊆ 푉 that form a clique, we associate a non–negative potential 휙푐 = 휙푐(Xc).

In this paper we will refer equivalently to the nodes of 퐺 that form the clique Xc and the random variables that correspond to those nodes. Before formulating the Gibbs measure let us introduce the following notations:

 CG = {Xc1, Xc2, … , Xcd } = {Xcj ∶ 1 ≤ 푗 ≤ 푑, 푑 ≤ 푛} represents a set of 푑 cliques that cover the edges and nodes of the underlying graph 퐺;

12

| |  ΦG = {휙c1, 휙c2, … , 휙cd } = {휙cj ∶ 1 ≤ 푗 ≤ 푑, 푑 = CG } represents the set of potential functions

or clique potentials that correspond to CG;

 There is a one–to–one correspondence between CG and ΦG, i.e., 휙cj = 휙cj(Xcj). Therefore, it

should be generally understood that, when iterating over CG, we also iterate over ΦG.

The Gibbs measure is precisely the joint probability distribution of all the nodes in the graph 퐺 = (푉, 퐸) and is obtained by taking the product over the clique potentials:

1 퐏(X) = ∙ ∏ 휙 (X ) 푍 푐 c (2.11) Xc∈CG where:

푍 ≡ 푍(퐏) = ∑ ∏ 휙푐(Xc) (2.12) X Xc∈CG

퐴 ≡ 퐴(퐏) = log(푍 (퐏)) ≡ ln(푍(퐏)) = ln(푍) (2.13) where 푍, called the partition function, is a constant chosen to ensure that the distribution 퐏 is normalized. If the distribution 퐏 belongs to the exponential family, it is more practical to work with the logarithm, specifically the natural logarithm, of the partition function 푍. By definition, the cumulant function 퐴 is the natural logarithm of Z.

The set CG is often taken to be the set of all maximal cliques of the graph 퐺, i.e., the set of cliques that are not properly contained within any other clique. This condition can be imposed without loss of generality because any representation based on non–maximal cliques can always be converted into one based on maximal cliques by redefining the potential function on a maximal clique to be the product over the potential functions on the subsets of that clique.

However, the factorization of a Markov random field is of particular value when CG consists of more than the maximal cliques. This is the case of factor graphs.

Definition 2.6:

Given a factorization of a function:

푚 푛 푔 ∶ ℝ → ℝ, 푔(푋1, 푋2, … , 푋푛) = ∏ 푓푗 (푆푗) 푗=1 where: (2.14) 푆푗 ⊑ {푋1, 푋2, … , 푋푛}

13

The corresponding factor graph 퐺 = (푋, 퐹, 퐸) is a bipartite graph that consists of: variable nodes

푋 = {푋1, 푋2, … , 푋푛}, factor nodes 퐹 = {푓1, … 푓푚}, and edges 퐸. The edges depend on the factorization as follows: there is an undirected edge between factor 푓푗 and variable 푋푖 if and only if 푋푖 is an argument of 푓푗, i.e., 푋푖 ∈ 푆푗.

Factor graphs allow a finer–grained specification of factorization properties by explicitly representing potential functions for non–maximal cliques. We observe that a factor graph has only node potentials and pairwise potentials. Generally, if the potential functions in a Markov random field are defined over single variables or pairs of variables, then the Markov random field is referred as pairwise Markov network. More precisely, a pairwise Markov network over a graph 퐺 = (푉, 퐸) is a Markov random field associated with a set of node potentials and a set of edge potentials as described by the equation (2.15):

ΦG = {휙(푋i) ∶ 푋i ∈ 푉, 1 ≤ 푖 ≤ 푛} ∪ {휙(푋i, 푋푗) ∶ {푖, 푗} ∈ 퐸, 1 ≤ 푖, 푗 ≤ 푛} (2.15)

A factor graph is a pairwise Markov network whose nodes and edges are endowed with special meanings that originate in the function it factorizes. We will come back at the relationship between Markov random fields and factor graphs in Section 3.5.

One important property of Markov random fields is that the potential functions ΦG need not have any obvious or direct relation to marginal or conditional distributions defined over the graph cliques.

Theorem 2.1 (Hammersley–Clifford):

A probability distribution that has a positive mass or density satisfies the Markov property with respect to an undirected graph if and only if it is a Gibbs distribution in that graph.

The proof of this theorem is outside the scope of this paper. A rigorous mathematical proof can be found in [48]. The theorem Hammersley–Clifford gives the necessary and sufficient conditions under which a Gibbs measure is equivalent with a Markov random field. Consequently, any positive probability measure that satisfies a Markov property is a Gibbs measure for an appropriate choice of (locally defined) energy function.

14

The learning algorithms in a pairwise Markov network like the Boltzmann machine require computing statistical quantities (e.g., likelihoods and probabilities) and information–theoretic quantities (e.g., mutual information and conditional ) on the underlying graphical model. These types of computational tasks in a graphical model are called inference or probabilistic inference. Furthermore, the learning algorithms are built on inference algorithms and allow parameters and structures to be estimated from data. However, exact inference for large–scale Markov random fields is intractable. Therefore, to achieve a scalable learning algorithm, approximate methods are required.

One popular source of approximate methods is the Markov chain Monte Carlo (MCMC) framework. The main problem with the MCMC approach is that convergence times can be long and it can be difficult to diagnose convergence.

An alternative to MCMC is the variational framework whose goal is to convert the probabilistic inference problem into an optimization problem. The best known variational algorithm used in Boltzmann machine learning is the mean field approximation that searches for the best distribution that assumes independence among all the nodes and then uses it to construct the true posterior distribution over hidden variables.

Another alternative to MCMC is the belief propagation (BP) framework. BP is a message passing algorithm for performing inference on tree–like graphs. The discovery of the relationship between belief propagation and Bethe free energy led to the so–called Bethe approximation of the free energy, which led to a new class of learning algorithms for Boltzmann machine.

2.3 Gibbs free energy

The third millennium has brought exciting progresses on understanding computationally hard problems in computer science by using a variety of concepts and methods from . One of these concepts is the Gibbs free energy. In this section we start by briefly introducing the Gibbs free energy as a thermodynamic potential; then we explain how this energy can be accommodated to describe a Markov random field. In subsequent development we use the term temperature to designate the absolute temperature of a canonical ensemble or, generally, of a thermodynamic system, and the term pseudo–temperature to designate the “temperature” of a Markov random field, i.e., a parameter which models the thermal noise

15

injected into the system. The majority of theoretical results reviewed in this section come from [49] and [4].

The Gibbs free energy, originally called the “available energy”, was developed in the 1870s by Josiah Willard Gibbs, who described it in [50] as:

the greatest amount of mechanical work which can be obtained from a given quantity of a certain substance in a given initial state, without increasing its total volume or allowing heat to pass to or from external bodies, except such as at the close of the processes are left in their initial condition.

The Gibbs free energy is one of the four thermodynamic potentials used in the chemical thermodynamics of reactions and non–cyclic processes. The other three thermodynamic potentials are: , , and Helmholtz free energy. In this paper we are only interested in the internal energy, the Helmholtz free energy, and the Gibbs free energy.

Generally, energy is a concept which takes into account the physical nature of a system. The exact (true) energy 퐸 is usually unknown, but the mean (internal) energy 푈 is usually known – for example when is determined by external factors such as a thermostat.

The internal energy 푈 is a thermodynamic potential that might be thought of as the energy contained within a system, otherwise the energy required to create a system in the absence of changes in temperature or volume.

If the system is created in an environment of temperature 퐓, then some of the energy can be obtained by spontaneous heat transfer from the environment to the system. The amount of this spontaneous energy transfer is 퐓 ∙ 푆, where 퐓 represents the temperature and 푆 represents the final entropy of the system. The Helmholtz free energy 퐹 is then a measure of the amount of energy required to create a system once the spontaneous energy transfer to the system from the environment is accounted for:

퐹 = 푈 − 퐓 ∙ 푆 (2.16) where: 푈 is the internal energy; 푆 is the entropy; and 퐓 is the temperature of the system. At low , the Helmholtz free energy is dominated by the energy, while at high temperatures, the entropy dominates it. The Helmholtz free energy is commonly used for systems held at constant volume. More, for a system at constant temperature and volume, the Helmholtz free energy is minimized at equilibrium.

16

If the system is created from a very small volume, in order to "create room" for the system, an additional amount of work P ∙ V must be done, where P represents the absolute pressure and V represents the final volume of the system. As discussed in defining the Helmholtz free energy, an environment at constant temperature 퐓 will contribute an amount 퐓 ∙ 푆 to the system, reducing the overall investment necessary for creating the system. The Gibbs free energy 퐺 is then the net energy contribution for a system created in an environment of temperature 퐓 from a negligible initial volume:

퐺 = 푈 − 퐓 ∙ 푆 + P ∙ V (2.17) where: 푈 is the internal energy; 푆 is the entropy; 퐓 is the temperature; P is the absolute pressure; and V is the final volume of the system. For a system at constant pressure and temperature, the Gibbs free energy is minimized at equilibrium.

In the context of Markov random fields, energy is a scalar quantity used to represent the state and the parameters of the system in certain conditions. Similarly to a thermodynamic system, the true energy of a Markov random field is unknown. The true energy of a Markov random field at equilibrium is referred as the true free energy and corresponds to the true joint probability distribution 퐏(푋1, … , 푋푛) of the random field. If the true joint probability distribution has a positive mass or density, then, according with Theorem 2.1, it is a Boltzmann–Gibbs distribution.

The internal energy 푈 of a Markov random field 퐏(푋1, … , 푋푛) is defined as the expected value of the exact energy 퐸 of the system.

푈퐏 = 퐄퐏[퐸(푋1, … , 푋푛)] = ∑ 퐏(푋1, … , 푋푛) ∙ 퐸(푋1, … , 푋푛) (2..18) 푋1,…,푋푛

The entropy 푆 of a Markov random field 퐏(푋1, … , 푋푛) is defined as the expected value of the logarithm of the inverse of the probability distribution 퐏 of the system (equations (B26) and (B27) from Appendix B):

1 푆퐏 = 퐄퐏 [ln ] = − ∑ 퐏(푋1, … , 푋푛) ∙ ln(퐏(푋1, … , 푋푛)) 퐏(푋1, … , 푋푛) (2.19) 푋1,…,푋푛

The exact Gibbs free energy can be thought of as a mathematical construction designed so that its minimization leads to the Boltzmann–Gibbs distribution given by the equation (2.5) [49]. In order to define the exact Gibbs free energy, we write the equation (2.5) for a Markov random field as:

17

1 퐸(푋 , … , 푋 ) (2.20) 퐏(푋 , … , 푋 ) = ∙ exp (− 1 푛 ) 1 푛 푍 퐓 where 퐸(푋1, … , 푋푛) is the true energy of the Markov random field (adjusted with the Boltzmann’s constant) and 퐓 is the pseudo–temperature of the Markov random field.

By definition, the exact Gibbs free energy denoted 퐺exact is the following function of the true joint probability function 퐏(푋1, … , 푋푛):

퐺푒푥푎푐푡[퐏(푋1, … , 푋푛)] = 푈퐏 − 퐓 ∙ 푆퐏 (2.21) where: 푈퐏 is given by the equation (2.18); 푆퐏 is given by the equation (2.19); and 퐓 is the pseudo–temperature of the system.

We note the absence from the equation (2.21) of the term P ∙ V of the equation (2.17). This absence is explained by the fact that the parameters pressure and volume of a thermodynamic system do not have any correspondent in a Markov random field. Hence, the exact Gibbs free energy of a Markov random field is apparently identical with the Helmholtz free energy. However, there is a difference of nuance between them: while the Helmholtz free energy is just the value 푈퐏 − 퐓 ∙ 푆퐏 computed at equilibrium, the Gibbs free energy is a function that computes the expression 푈퐏 − 퐓 ∙ 푆퐏 for any state of the network after applying some constraints [49].

At equilibrium, the exact Gibbs free energy is equal to the Helmholtz free energy, which is given by the formula:

퐹 = −퐓 ∙ ln(푍) (2.22) where 푍 is the partition function of the Markov random field [49].

It can be shown that, if we minimize 퐺푒푥푎푐푡 given by (2.21) with respect to 퐏(푋1, … , 푋푛) and enforce, via a Lagrange multiplier, the constraint of 퐏 being a probability distribution, then we recover, as desired, the Boltzmann–Gibbs distribution.

Different types of constraints can be imposed on various probabilities that characterize the Markov random field and each such scenario “produces” a Gibbs free energy. By minimizing a Gibbs free energy with respect to the probabilities that are constrained, we obtain self– consistent equations that must be obeyed at equilibrium [49].

In general, a given system can have more than “one Gibbs free energy” depending on what constraints are applied and over what probabilities. If the full joint probability is constrained, then we obtain the exact Gibbs free energy denoted 퐺푒푥푎푐푡. If some or all marginal probabilities are

18

constrained, then we obtain an approximate Gibbs free energy denoted 퐺. The mean field free energy and the Bethe energy, which we are going to introduce in Chapter 3, are both Gibbs free energies. The advantage of working in a Markov random field with a Gibbs free energy instead of a Boltzmann–Gibbs distribution is that it is much easier to come up with ideas for approximations [49].

2.4 Connectionist networks

In order to emphasize their brain–style computational properties, Hinton has characterized Boltzmann machines as connectionist networks, specifically symmetrical connectionist networks with hidden units. Their counterparts with respect to the presence of hidden units are the Hopfield networks. Before submerging into the world of Boltzmann machines, we are going to briefly present their “ancestors”: the connectionist networks and the Hopfield networks.

Connectionism is a set of approaches in the field of that models mental or behavioral phenomena as the emergent processes of interconnected networks of simple units. The central connectionist principle is that mental phenomena can be rather described from the point of view of brain–style computation rather than rule–based symbol manipulation. However, the connectionist architectures are not meant to duplicate the physiology of the human brain, but rather to receive inspiration from known facts about how the brain works [51]. There are many forms of , but the most common form uses artificial neural network models.

Connectionist models typically consist of many simple neuron–like processing elements called units that interact by using weighted connections. The connections between units can be symmetrical or asymmetrical, depending on whether they have the same weight in both directions or not. Each unit has a state or activity level that is determined by the input received from other units in the network. There are many possible variations within this general framework: 1/0, +1/-1, on/off. When the effective values of the states of the units are not important for the argument we try to make, we refer to them as on/off. One common, simplifying assumption is that the combined effects of the rest of the network on the ith unit are mediated by a single scalar quantity. This quantity, which is called the total input of unit i and denoted neti, is a linear function of the activity levels of the units that provide input to unit 푖:

19

neti = ∑ 휎푗 ⋅ 푤푗푖 − 휃푖 (2.23) 푗

th th th where: 휎푗 is the state of the j unit; 푤푗푖 is the weight on the connection from the j to the i unit; th and 휃푖 is the threshold of the i unit.

An external input vector can be supplied to the network by clamping the states of some units or by adding an input term to the total input of some units. By taking into consideration the external input, the total input of unit 푖 is computed with formula:

neti = ∑ 휎푗 ⋅ 푤푗푖 + 퐼푖 − 휃푖 (2.24) 푗

th th th where: 휎푗 is the state of the j unit; 푤푗푖 is the weight on the connection from the j to the i unit; th th 휃푖 is the threshold of the i unit; and 퐼푖 is the external input received by the i unit.

The threshold term can be eliminated by giving every unit an extra input connection whose activity level is always on. The weight on this special connection is the negative of the threshold, and it can be learned in just the same way as the other weights.

The capacity of a network to change over time is expressed at unit level by the concept of activation. At any time, a unit in the network has an activation, which is a numerical value intended to represent some aspect of the unit, which is often called the state of the unit. The activation of a unit spreads to all the other units connected to it. Typically the state of a unit is described as a function, called the activation function, of the total input that it receives from its input units. Usually the activation function is non–linear, but it can be linear as well.

For units with discrete nonnegative states, the activation function typically has value 0 or 1.

For units with continuous nonnegative states a typical activation function is the logistic sigmoid defined by the formula (2.25).

For units with discrete bipolar states, the typical activation function has value -1 or 1.

For units with continuous positive and negative states a typical activation function is the hyperbolic tangent defined by the formula (2.26).

States 0/1: 1 휎푖 = sigm(neti) = (2.25) 1 + exp(−neti)

States -1/1: exp(neti) − exp(−neti) exp(2 ∙ neti) − 1 휎푖 = tanh ( neti) = = (2.26) exp(neti) + exp(−neti) exp(2 ∙ neti) + 1

20

th where neti is the input of the i unit and 휎푖 is the state of the same unit.

All the long–term knowledge in a connectionist model is encoded by the locations and the weights of the connections, so learning consists of changing the weights or adding or removing connections. The short–term knowledge of the model is normally encoded by the states of the units, but some models also have fast–changing temporary weights or thresholds that can be used to encode temporary contexts or bindings [51].

2.5 Hopfield networks

Historically the Boltzmann machine was preceded by a simpler connectionist model invented by John Hopfield in 1982. Hopfield’s network is not only a precursor of Boltzmann machine; it also represents the limiting case of the asynchronous Boltzmann machine as the pseudo–temperature parameter 퐓 → 0. The network proposed by Hopfield in [7] and expanded in [52] is a symmetrical connectionist network without hidden units whose main purpose is to store memories as distributed patterns of activity. Hopfield, who is a physicist, got the idea of a network acting as an associative memory by studying the dynamics of a physical system whose state space is dominated by a substantial number of locally stable states to which the system is attracted. He regarded these numerous locally stable states as associative memory or content addressable memory. Before we present the Hopfield network, we are going to briefly introduce the ideas of associative memory and Hebbian learning which are used by the learning algorithms of both Hopfield network and Boltzmann machine.

Inspired by the associative nature of biological memory, Hebb proposed in 1949 a simple model for the neuron that captures the idea of associative memory [2]. Hebb’s theory is often summarized by Siegrid Löwel's phrase: "neurons wire together if they fire together" [53]. We are going to present the intuition behind Hebb’s theory by using an example. Let imagine that the weights between neurons whose activities are positively correlated are increased:

d 푤 = corr(휎 , 휎 ) (2.27) dt 푖푗 푖 푗 where corr(휎푖, 휎푗) is the correlation coefficient between the states 휎푖 and 휎푗.

Let also imagine the following two scenarios:

 when stimulus 푖 is present – for instance, a bell ringing – the activity of neuron 푖 increases;

21

 neuron 푗 is associated with another stimulus 푗 – for instance, the sight of a teacher coming to the classroom carrying a register.

If these two stimuli – first a person formally dressed and carrying a register and second a ringing bell – co–occur in the environment, then the Hebbian will increase the weights 푤푖푗 and 푤푗푖. This means that when, on a later occasion, stimulus 푗 occurs in isolation, making the activity of 휎푗 large, the positive weight from 푗 to 푖 will cause neuron 푖 to be also activated. Thus, the response to the sight of a formally dressed person carrying a register is an automatic association with the bell ringing sound. Hence, we would expect to hear a bell ringing. We could call this "pattern completion". No instructor is required for this associative memory to work and no signal is needed to indicate that a correlation has been detected or that an association should be made. Thus, the unsupervised local learning algorithm and the unsupervised local activity rule spontaneously produce the associative memory.

2.5.1 Hopfield network models

In his influential paper [7] Hopfield proposed a model that was later called the binary Hopfield network. Later he generalized the original model and published in [52] a new model called the continuous Hopfield network. In [52] Hopfield also explained the relationship between the two models. Because a binary Hopfield network becomes a Boltzmann machine with the addition of noise in updating, we give a detailed presentation of the binary model and only a brief presentation of the continuous model. We also briefly present the relation between the stable states of the Hopfield models.

2.5.1.1 The binary Hopfield model

 Architecture:

A binary Hopfield network consists of 푁 processing devices called neurons or units. Each unit 푖 has two activation levels or states: off or not firing, usually represented as 휎푖 = 0, and on or firing at maximum rate, usually represented as 휎푖 = 1. An alternative representation of the off/on states uses the bipolar elements -1 and +1.

22

In the Hopfield network there are weights associated with the connections between units. All these weights are organized in a matrix W = (푤푖푗)1≤푖≤푛 called the weight matrix or the 1≤푗≤푛 correlation matrix. The strength of connection between two units 푖 and 푗 is called weight and is denoted 푤푖푗. The units are connected through symmetric, bidirectional connections, so 푤푖푗 =

푤푗푖. If two units 푖 and 푗 are not connected, then 푤푖푗 = 0. If they are connected, then 푤푖푗 > 0 or

푤푖푗 < 0. There are no self–connections, so 푤푖푖 = 0 for all 푖 ∈ {1,2, … , 푛}.

The activity of unit 푖, denoted neti, represents the total input that the unit receives from other units and is computed either with the equation (2.23) or with the equation (2.24), depending on the presence of the external input. Unless otherwise stated, we consider the external input 퐼푖 for each unit 푖 to be 0. The units are binary threshold units, i.e., for each unit 푖 there is a fixed threshold 휃푖 ≥ 0. We can think of the threshold of unit 푖 as the weight of a special connection from a virtual unit “0”, whose activity is permanently on, towards unit 푖. We formally express this relation as: 휃푖 = −푤푖0. Then, if we include the threshold in the computation of the activity of the unit, the equation (2.23) becomes:

푛 푛 neti = ∑ 푤푗푖 ∙ 휎푗 − 휃푖 = ∑ 푤푗푖 ∙ 휎푗 for 1 ≤ 푖 ≤ 푛 (2.28) 푗=1 푗=0

The instantaneous state of a model composed of 푛 units is specified by a configuration or state vector 𝝈 whose elements represent the states of the units: 𝝈 = (휎1, 휎2, … , 휎푛).

 Global energy:

Hopfield realized that, when the weight matrix W is symmetric, the network can be characterized by a global energy function [7]. More, each configuration of the network can also be characterized by an energy function. The global energy of the network is a sum of contributions from each unit and is computed with the following formula:

푛 푛 푛 1 퐸 = − ∑ ∑ 푤 ∙ 휎 ∙ 휎 + ∑ 휎 ∙ 휃 (2.29) 2 푖푗 푗 푖 푖 푖 푖=1 푗=1 푖=1

This simple quadratic energy function makes it possible for each unit to compute locally how its state affects the global energy.

 Update rule:

23

The state of the model system changes in time as a consequence of each unit 푖 readjusting its state. While the selection of the unit to be updated could be a (taking place at a mean rate 푟 > 0 for each unit) or a deterministic process (being part of a predefined sequence), the update itself is always a deterministic process. Each selected unit evaluates whether its activity is above or below zero (because we included the threshold into the computation of the unit’s activity) and updates its state according with the following “threshold rules”:

푛 휎푖 → 0 if neti = ∑ 푤푗푖 ∙ 휎푗 − 휃푖 ≤ 0 States 0/1: 푗=1 푛 (2.30)

휎푖 → 1 if neti = ∑ 푤푗푖 ∙ 휎푗 − 휃푖 > 0 푗=1

푛 휎푖 → −1 if neti = ∑ 푤푗푖 ∙ 휎푗 − 휃푖 ≤ 0 States -1/1: 푗=1 푛 (2.31)

휎푖 → 1 if neti = ∑ 푤푗푖 ∙ 휎푗 − 휃푖 > 0 푗=1

Equivalently, the update rule can be formulated as: update each unit to whichever of its two states gives the lowest global energy. The updates may be synchronous or asynchronous and, because the network has feedback (i.e., every unit’s output is an input to other units), an order for the updates to occur has to be specified.

 Synchronous (parallel) updates: firstly all units compute their activities (neti)1≤푖≤푛, and

secondly they update their states (휎푖)1≤푖≤푛 simultaneously.

There are a few drawbacks for this update strategy. Firstly, if the units make simultaneously decisions, their energy could go up. Secondly, with simultaneous parallel updating, we can get oscillations which always have a period of two. However, if the updates occur in parallel but with random timing, the oscillations are usually destroyed.

 Asynchronous (sequential) updates: one unit at a time computes its activity neti and

updates its state 휎푖.

When units are randomly chosen to update, the global energy 퐸 of the network will either lower its value or stay the same. Under repeated sequential updating the network will eventually converge to a state which is a local minimum in the global energy function.

24

Thus, if a state is a local minimum in the global energy function, it is a stable state for the network.

 Learning rule:

The first goal of the Hopfield network is to store the input data or desired memories – this is what we call the store phase. The desired memories are represented as a set with 푚 elements, each element being a 푛–dimensional binary vector.

The second goal is that, given the initial configuration of a Hopfield network, the network is capable to retrieve or recall one particular configuration or stored memory from all the memories stored in the network – this is what we call the recall phase.

In general, the initial configuration is a noisy version of one stored memory. The learning rule is intended to make a set of desired memories to become stored memories, i.e., stable states of the Hopfield network's activity rule. In order to understand how the Hopfield network learns a set of desired memories, firstly we present the information storage rules and secondly we prove that the stored memories are stable states for Hopfield network.

 Information storage algorithm:

We start by observing that each desired memory represents a possible configuration of the network:

(풔) (푠) (푠) (푠) 𝝈 = (휎1 , 휎2 , … , 휎푛 ) for all 푠 ∈ {1,2, … , 푚} (2.32)

Hopfield demonstrated that the capacity 푚 of a totally connected network with 푛 units under his storage rule is only about 0.15푛 memories [7]. Also, if all the desired memories are known, the matrix W does not change in time. Hence, it can be determined in advance.

Hopfield proposed the following rule for computing the weights of a network whose purpose is to 1 store a given set of 푚 desired memories. In both cases the factor assures that |푤 | ≤ 1. 푚 푖푗

푚 1 (푠) (푠) States 0/1: 푤 = ∙ ∑ (2 ∙ 휎 − 1) ∙ (2 ∙ 휎 − 1) for 1 ≤ 푖, 푗 ≤ 푛; 1 ≤ 푠 ≤ 푚 푖푗 푚 푖 푗 (2.33) 푠=1 푤푖푖 = 0 for 1 ≤ 푖 ≤ 푛

푚 1 ( ) ( ) States -1/1: 푤 = ∙ ∑ 휎 푠 ∙ 휎 푠 for 1 ≤ 푖, 푗 ≤ 푛; 1 ≤ 푠 ≤ 푚 푖푗 푚 푖 푗 (2.34) 푠=1 푤푖푖 = 0 for 1 ≤ 푖 ≤ 푛

25

There is another way to compute the weight matrix W. The algorithm starts from a matrix W with all the elements equal to zero. For each binary vector 휎 that represents a desired memory, the weight 푤푖푗 between any two units 푖 and 푗 is incremented with a quantity Δ푤푖푗:

푤푖푗 ← 푤푖푗 + Δ푤푖푗 for 1 ≤ 푖, 푗 ≤ 푛 (2.35) where Δ푤푖푗 is computed with the following formulae:

States 0/1: 1 1 Δ푤 = 4 ⋅ (휎 − ) ⋅ (휎 − ) for 1 ≤ 푖, 푗 ≤ 푛 푖푗 푖 2 푗 2 (2.36) States -1/1: Δ푤푖푗 = 휎푖 ∙ 휎푗 for 1 ≤ 푖, 푗 ≤ 푛

The rules (2.28) to (2.29) are applied to the whole matrix 푚 times, one time for each desired memory. After these steps each weight 푤푖푗 has an integer value in the range [−푚, 푚]. Finally, 1 the weight matrix W may be normalized by multiplying it with the factor . 푚

Once the matrix W was computed, the desired memories have become stored memories. Now we need to prove that the stored memories are stable states for the Hopfield network. We are going to present the proof only for the case when the states of the units are represented 0/1. The proof for the case when the states of the units are represented -1/1 is similar.

(풔) In order to prove that the stored memories (𝝈 )1≤푠≤푚 are stable states for the Hopfield

(s) th network, we start by computing neti which is the activity of some unit 푖 of some 풔 stored memory:

푛 푛 푚 (s) (푠) (푠) (푢) (푢) neti = ∑ 푤푗푖 ⋅ 휎푗 = ∑ 휎푗 ⋅ ∑(2 ⋅ 휎푖 − 1) ⋅ (2 ⋅ 휎푗 − 1) 푗=0 푗=0 푢=1 (2.37) 푚 푛 (s) (푢) (푠) (푢) neti = ∑(2 ⋅ 휎푖 − 1) ∙ [∑ 휎푗 ⋅ (2 ⋅ 휎푗 − 1)] 푢=1 푗=0

In the equation (2.37) the mean value of the bracketed term is 0 unless 푠 ≡ 푢, in which case the mean value is 푛/2. This pseudo–orthogonality yields to:

푛 ( ) ( ) 푛 net s = ∑ 푤 ⋅ 휎 (푠) ≅ 〈net s 〉 = ∙ (2 ⋅ 휎 (푠) − 1) for 1 ≤ 푖 ≤ 푛 (2.38) i 푗푖 푗 i 2 푖 푗=0

26

(s) (푠) (푠) The equation (2.38) shows that 〈neti 〉 is positive when 휎푖 = 1 and negative when 휎푖 = 0. The sth stored state would always be stable under Hopfield’s algorithm except the noise coming from the 푠 ≠ 푢 terms.

2.5.1.2 The continuous Hopfield model and its relation with the binary Hopfield model

0 1 Let consider a binary Hopfield network where the set of possible states 휎푖 of unit 푖 is {Vi , Vi } 0 1 0 1 where Vi ∈ ℝ, Vi ∈ ℝ, Vi < Vi , and 1 ≤ 푖 ≤ 푛.

Let also consider another Hopfield network identical with the first one except the following aspects:

 the output variable Vi of unit 푖 is a continuous and monotone increasing function of the

instantaneous input neti of the same unit; 0 1  the output variable Vi of unit 푖 has the range Vi ≤ Vi ≤ Vi ; 0  the input–output relation is described by a sigmoid function with vertical asymptotes Vi and 1 Vi .

In the second network the sigmoid function acts as an activation function and the output Vi of unit 푖 is similar to the state 휎푖 of unit 푖 in the first network:

휎푖 ≡ Vi for 1 ≤ 푖 ≤ 푛 (2.39)

0 1 If Vi and Vi are 0 respectively 1, then an appropriate activation function for the second network is the logistic sigmoid and the activity of the unit 푖 is computed with the formula (2.40).

0 1 If Vi and Vi are -1 respectively 1, then an appropriate activation function for the second network is the hyperbolic tangent and the activity of the unit 푖 is computed with the formula (2.41).

0 1 1 {Vi , Vi } = {0,1} Vi = sigm(neti) = 푛 (2.40) 1 + exp(− ∑푗=1 푤푗푖 ∙ 휎푗 + 휃푖)

0 1 푛 {Vi , Vi } = {−1,1} Vi = tanh ( neti) = tanh (∑ 푤푗푖 ∙ 휎푗 − 휃푖) (2.41) 푗=1

Each unit updates its state as if it were the single unit in the network. The updates may also be synchronous or asynchronous and the learning rule is similar to the learning rule of the binary

27

network. We observe that the binary Hopfield network is a special case of the continuous Hopfield network. The continuous model has the same flow properties in its continuous space that the binary model does in its discrete space. It can, therefore, be used as a content addressable memory or any other computational task which an energy function is essential for.

The relation between the stable states of the two models

For a given weight matrix W, the stable states of the continuous system have a simple correspondence with the stable states of the binary system. The discrete algorithm searches for minimal states at the corners of the hypercube, i.e., corners that are lower than adjacent corners. Since the global energy of the model is a linear function of a single unit state along any cube edge, the energy minima (or maxima) for the discrete space with, for instance, activities

휎푖 ∈ {0,1} are exactly the same corners as the energy minima (or maxima) for the continuous case with activities 0 ≤ Vi ≤ 1 [52].

2.5.2 Convergence of the Hopfield network

Hopfield claimed that his original model behaves as an associative memory when the state space flow generated by the algorithm is characterized by a set of stable fixed–points such that each stable point represents a nominally assigned memory. He proved that the stored memories are stable under the asynchronous update rule and, more, the asynchronous update rule of a Hopfield network is able to take a partial memory or a corrupted memory and perform pattern completion or error correction to restore the original memory [7,50]. The proof relies on an essential feature of the store–recall operation: the state space flow algorithm converges to stable states. The flow convergence to stable states is guaranteed by a mathematical condition imposed on the weight matrix W: to be symmetric and to have zero diagonal elements. Here we present a sketch of the proof for the case of a binary Hopfield network with asynchronous updates. The proof for the continuous Hopfield network is pretty similar.

Claim: The binary threshold update rules (2.30) and (2.31) cause the network to settle to a minimum of the global energy function.

28

Proof: The proof follows the construction of an appropriate energy function 퐸 (equation (2.29)) that is always decreased by any state change produced by the algorithm.

First we introduce the concept of energy gap; then we compute the energy gap of some unit 푖, where 1 ≤ 푖 ≤ 푛. The energy gap of unit 푖 represents the change Δ퐸푖 in global energy 퐸 due to changing the state of the unit 푖 by Δ휎푖 and keeping all the other units unchanged. In order to compute Δ퐸푖, we rewrite the equation (2.29) by separating the contribution of unit 푖 from the contributions of all the other units:

푛 푛 푛 1 퐸 = − ∑ ∑ 푤 ∙ 휎 ∙ 휎 + ∑ 휎 ∙ 휃 2 푖푗 푗 푖 푖 푖 푖=1 푗=1 푖=1

푛 푛 푛 푛 1 퐸 = − ∑ ∑ 푤 ∙ 휎 ∙ 휎 + ∑ 휎 ∙ 휃 + (− ∑ 푤 ∙ 휎 ∙ 휎 +∙ 휎 ∙ 휃 ) 2 푘푗 푗 푘 푘 푘 푖푗 푗 푖 푖 푖 푘=1, 푗=1, 푘=1, 푗=1,푗≠푖 ( 푘≠푖 푗≠푖 푘≠푖 )

푛 푛 푛 푛 1 퐸 = − ∑ ∑ 푤 ∙ 휎 ∙ 휎 + ∑ 휎 ∙ 휃 + (− ∑ 푤 ∙ 휎 + 휃 ) ∙ 휎 (2.42) 2 푘푗 푗 푘 푘 푘 푖푗 푗 푖 푖 푘=1, 푗=1, 푘=1, 푗=1,푗≠푖 ( 푘≠푖 푗≠푖 푘≠푖 )

In the equation (2.42) the content of the first parenthesis doesn’t depend on the state of unit 푖.

Consequently, the first parenthesis is eliminated during the computation of Δ퐸푖.

States 0/1: 푛

Δ퐸푖 = 퐸(휎푖 = 0) − 퐸(휎푖 = 1) = − ( ∑ 푤푖푗 ∙ 휎푗 − 휃푖) ∙ Δ휎푖 (2.43) 푗=1,푗≠푖

States -1/1: 푛

Δ퐸푖 = 퐸(휎푖 = −1) − 퐸(휎푖 = 1) = − ( ∑ 푤푖푗 ∙ 휎푗 − 휃푖) ∙ Δ휎푖 (2.44) 푗=1,푗≠푖

According to the equation (2.23), the content of the parenthesis in both equations (2.43) and

(2.44) is exactly neti. Hence, the equations (2.43) and (2.44) can be compactly written as:

Δ퐸푖 = −neti ∙ Δ휎푖 (2.45)

According with the update rules (2.30) and (2.31), Δ휎푖 is positive (state changes from 0 to 1 respectively from -1 to 1) only when neti is positive and is negative (state changes from 1 to 0 respectively from 1 to -1) only when neti is negative. Therefore, any change in the global energy

29

퐸 under the algorithm is negative; otherwise the global energy 퐸 is a monotonically decreasing function. More, for a given set of weights W and a given set of thresholds (휃푖)1≤푖≤푛 the global energy 퐸 is both lower and upper bounded. Hence, the iteration of the algorithm must lead to stable states that do not further change with time.

The following algorithm describes the dynamics of a trained Hopfield network that uses the representation of states as 0/1, starts from a given configuration, and converges to a stable configuration. If the states are represented as -1/1, the step 4 of Algorithm 2.1 needs to be modified correspondingly.

Algorithm 2.1: Hopfield Network Dynamics

Given: a trained network W and an initial configuration 휎

begin

Step 1: repeat

Step 2: choose a unit 푖 at random with mean rate 푟 > 0

Step 3: compute the activity of the unit 푖: neti

Step 4: if neti > 0 and 휎푖 = 0 then set: 휎푖 = 1

if neti < 0 and 휎푖 = 1 then set: 휎푖 = 0

A unit that changes its state as described above becomes “satisfied”.

If state’s change is not necessary, then the unit is already satisfied.

Step 5: until the current configuration is stable

A configuration is stable when all the units are satisfied.

end

In a Hopfield network the weight matrix W contains simultaneously many memories. We refer to the process of incorporating all these memories into the network’s weights as training. The training process is described by the equations (2.33) and (2.34). A trained Hopfield network converges to a stable configuration that generally depends on the initial configuration of the network. This means that the stored memories or stable points can be individually reconstructed from partial information in an initial state of the network.

30

If the stable points describe a simple flow in which nearby points in state space tend to remain close during the flow (i.e., a non–mixing flow), then initial states that are close (in Hamming distance) to a particular stable state and far from all others will tend to terminate in that nearby stable state [52]. If the initial state is ambiguous (i.e., not particularly close to any stable state), then the flow is not entirely deterministic and the system responds to that ambiguous state by a statistical choice between the memory states it most resembles [7].

States near a particular stable point contain partial information about the memory assigned to that stable point. From an initial state of partial information about a memory, a final stable state with all the information of the memory is found. The memory is reached not by knowing an address, but rather by supplying in the initial state some subpart of the memory. Any subpart of adequate size will do – the memory is truly addressable by content rather than location.

Because the Hopfield’s network uses its local energy minima to store memories, when the system is started near some local minimum, the desired behavior of the network is to fall into that local minimum and not to find the global minimum.

31

Chapter 3. Variational methods for Markov networks

Variational methods are used as approximation methods in a wide variety of settings. They have become very popular, since they typically scale well to large applications. The name variational method refers to a general strategy in which the problem to be solved is expanded to include additional parameters that increase the degrees of freedom over which the optimization is performed and which must be fit to the problem at hand. Each choice of these new parameters, called variational parameters, gives an approximate answer to the original problem. The best approximation is usually obtained by optimizing the variational parameters. In this way the “expansion” of the original problem is actually a modality to convert a complex problem into a simpler problem, where the simpler problem is generally characterized by a decoupling of the degrees of freedom in the original problem [27]. Throughout this chapter, we use the standard terminology for graphical models and we concentrate on Markov networks.

3.1 Pairwise Markov networks as exponential families

In this section we take a look at the parameterization of a pairwise Markov network, i.e. a representation of it as a parameterized family of probability distributions, which is the same as belonging to an exponential family of probability distributions. Our approach is justified by the fact that the particularities of the Boltzmann machine model are due to its parameterization and not to its conditional independence structure. Therefore, we start by defining the concept of exponential family together with a few related concepts. Then we apply them to obtain an exponential form for a pairwise Markov network. Next we define the concept of canonical parameters and we introduce the canonical representations for pairwise Markov networks. Then we define mean parameters and we introduce the mean parameterization for pairwise Markov networks. We end this section by exploring the role of mean parameters in inference problems. The majority of theoretical results presented in this section are taken from [47].

32

3.1.1 Basics of exponential families

In Section 2.2 we defined Markov networks in terms of products of potential functions (equations (2.11) to (2.13) and (2.15)). In this section we are going to see that, in an exponential family setting, these products become additive decompositions.

Let consider a pairwise Markov network defined over a graph 퐺 = (푉, 퐸) and associated with a set of random variables X = (푋1, … , 푋푛) where 푛 = |푉|. Without restricting the generality, let us assume that each random variable 푋푖, which is associated with node 푖 ∈ 푉, is Bernoulli, otherwise is taking the “spin” values/states from 퐈 = {0,1}.

푛 Let 횽퐆 = (휙푗)1≤푗≤푑 be a collection of 푑 potential functions, such that: 휙푗 ∶ I → ℝ for all 푗 ∈ {1,2 … 푑}. Here 푑 is the number of cliques that cover the edges and nodes of 퐺 and are in a one–to–one correspondence with 휙푗.

Given the vector of potential functions ΦG, we associate to it a vector of canonical or exponential parameters: 퐖 = (푊푗)1≤푗≤푑. If the vector of potential functions ΦG is fixed, then each parameter vector W indexes a particular probability distribution 퐏W of the family.

푛 푑 For each fixed X ∈ I , we use 〈W, ΦG〉 to denote the Euclidean inner product in ℝ between the vectors W and ΦG:

푑 〈퐖, 횽퐆〉 = ∑ 푊푗 ∙ 휙푗 (3.1) 푗=1

With these notations, the exponential family associated with the set of potential functions ΦG and the set of canonical parameters W consists of the following parameterized collection of probability density functions:

퐏W(X) = exp(〈W, ΦG〉 − 푨(W)) (3.2) where: (3.3) 푨(W) = ln ∑ exp(〈W, ΦG〉) X∈I푛

We are particularly interested in canonical parameters W that belong to the set:

훀 = {W ∈ ℝ푑 ∶ 푨(W) < +∞} (3.4)

The following notions are important in subsequent development:

33

 Regular families: An exponential family for which the domain Ω is an open set is known as a regular family.

 Minimal: It is typical to define an exponential family with a vector of potential functions ΦG 푑 such that there is no nonzero vector W ∈ ℝ such that 〈W, ΦG〉 is equal to a constant.

This condition gives rise to a so–called minimal representation, in which there is a unique canonical parameter vector W associated with each probability distribution 퐏.

 Overcomplete: Instead of a minimal representation, it can be convenient to use a non–

minimal or overcomplete representation, in which there exist linear combinations 〈W, ΦG〉 that are equal to a constant. In this case, there actually exists an entire affine subset of parameter vectors W, each associated with the same distribution.

3.1.2 Canonical representation of pairwise Markov networks

The potential functions of a pairwise Markov network, as described by the equation (2.15), are either node potentials or edge potentials. Therefore we can differentiate between the node– specific canonical parameters Θ = (휃푖)푖∈푉 and the edge–specific canonical parameters Ŵ =

(푤̂푖푗){푖,푗}∈퐸. This leads us to a new representation of the pairwise Markov network’s canonical parameters 퐖 = (퐖̂ , 횯). This new representation has a dimension 푑 = |푉| + |퐸| and will be of particular importance in Chapter 4 and in Chapter 5. Hence, the exponential form of a pairwise Markov network and the corresponding cumulant function are:

(3.5) 퐏W(X) = 퐏W(푋1, … , 푋푛) = exp (∑ 휃푖 ∙ 푋푖 + ∑ 푤̂푖푗 ∙ 푋푖 ∙ 푋푗 − 푨(W)) 푖∈푉 {푖,푗}∈퐸 where:

(3.6) 푨(W) = ln ( ∑ exp (∑ 휃푖 ∙ 푋푖 + ∑ 푤̂푖푗 ∙ 푋푖 ∙ 푋푗)) X∈I푛 푖∈푉 {푖,푗}∈퐸

The exponential form of a pairwise Markov network given by the equations (3.5) and (3.6) is a regular minimal representation. The representation is regular because the sums from the equation (3.5) are finite for all choices of W ∈ ℝ푑 and the domain Ω is the full space ℝ푑. The representation is minimal because there is no nontrivial inner product 〈W, ΦG〉 equal to a constant.

34

An alternative canonical representation of pairwise Markov networks, named the standard overcomplete representation, uses the indicator functions 핀풊;풔 and 핀풊풋;풔풕 as potential functions.

Each pairing of a node 푖 ∈ 푉 and a state 푠 ∈ I yields a node–specific indicator function 핀풊;풔 with an associated vector of canonical parameters Θi = (휃푖;푠)푠∈I.

1, if 푋 = 푠 (3.7) 핀 (푋 ) = { 푖 for all 푖 ∈ 푉, 푠 ∈ I 풊;풔 푖 0, otherwise

Each pairing of an edge {푖, 푗} ∈ 퐸 with a pair of states (푠, 푡) ∈ I × I yields an edge–specific indicator function 핀푖푗;푠푡, as well as the associated canonical parameter 푤̂푖푗;푠푡 ∈ ℝ.

1, if 푋 = 푠 and 푋 = 푡 (3.8) 핀 (푋 , 푋 ) = { 푖 푗 for all {푖, 푗} ∈ 퐸, (푠, 푡) ∈ I × I 풊풋;풔풕 푖 푗 0, otherwise

The indicator functions given by the equations (3.7) and (3.8) together with their associated canonical parameters define an exponential family with dimension 푑 = 2 ∙ |푉 | + 4 ∙ |퐸|. Hence, the exponential form of a pairwise Markov network with indicator functions given by the equations (3.7) and (3.8) is:

(3.9)

퐏W(X) = exp ∑ 핀풊;풔(푋푖) ∙ 휃푖;푠 + ∑ 핀풊풋;풔풕(푋푖, 푋푗) ∙ 푤̂푖푗;푠푡 − 푨(W) 푖∈푉, {푖,푗}∈퐸, ( 푠∈I (푠,푡)∈I×I ) where: (3.10)

푨(W) = ln ( ∑ exp ( ∑ 핀풊;풔(푋푖) ∙ 휃푖;푠 + ∑ 핀풊풋;풔풕(푋푖, 푋푗) ∙ 푤̂푖푗;푠푡)) X∈I푛 푖∈푉,푠∈I {푖,푗}∈퐸,(푠,푡)∈I×I

The exponential form of a pairwise Markov network given by the equations (3.9) and (3.10) is regular and overcomplete. Like the previous representation, the cumulant function 퐴 is everywhere finite, so that the family is regular. In contrast to the previous representation, this representation is overcomplete because the indicator functions satisfy various linear relations, like for instance: ∑푠∈I 핀풊;풔(푋푖) = 1 for all 푋푖 ∈ I.

3.1.3 Mean parameterization of pairwise Markov networks

So far, we have characterized a pairwise Markov network by its vector of canonical parameters W ∈ Ω. It turns out that any exponential family, particularly a pairwise Markov network, has an alternative parameterization in terms of a vector of mean parameters.

35

Let 퐏 be a given probability distribution that is a member of an exponential family and whose collection of potential functions is ΦG = (휙푗)1≤푗≤푑. Here all the potential functions are indexed by

푗, not only the node–related ones: 휙푗 = 휙j(X) = 휙푗(푋1, … , 푋푛). Then the mean parameter 휇푗 associated with the potential function 휙j, where 푗 ∈ {1,2 … 푑}, is defined by the expectation:

(3.11) 휇푗 ≝ 퐄퐏[휙j(X)] = ∑ 휙j(X) ∙ 퐏(X) X∈I푛

Thus, given an arbitrary probability distribution 퐏 from an exponential family, we defined a vector

훍 ≝ (휇1, … , 휇푑) of 푑 mean parameters such that there is one mean parameter 휇푗 for each potential function 휙j. We also define the set ℳ that contains all realizable mean parameters, i.e., all possible mean vectors μ that can be traced out as the underlying distribution 퐏 is varied:

푑 (3.12) 퓜 = {훍 ∈ ℝ ∶ ∃퐏 such that 퐄퐏[휙j(X)] = 휇푗 for all 푗 ∈ {1,2 … 푑}}

If the exponential family is a pairwise Markov network with indicator functions given by the equations (3.7) and (3.8), then the collection of potential functions ΦG takes the form:

횽퐆 = {핀풊;풔(푋푖) ∶ 푖 ∈ 푉, 푠 ∈ I} ∪ {핀풊풋;풔풕(푋푖, 푋푗) ∶ {푖, 푗} ∈ 퐸, (푠, 푡) ∈ I × I } (3.13)

The corresponding mean parameter vector μ ∈ ℝ푑, where 푑 = 2 ∙ |푉 | + 4 ∙ |퐸|, consists of marginal probabilities over singleton variables and marginal probabilities over pairs of variables that correspond to graph edges:

훍 = {휇푖;푠 ∶ 푖 ∈ 푉, 푠 ∈ I} ∪ {휇푖푗;푠푡 ∶ {푖, 푗} ∈ 퐸, (푠, 푡) ∈ I × I } (3.14) where: 휇푖;푠 = 퐄퐏[핀풊;풔(푋푖)] = 퐏[푋푖 = s] for all 푖 ∈ 푉, 푠 ∈ I (3.15) and: 휇푖푗;푠푡 = 퐄퐏[핀풊풋;풔풕(푋푖, 푋푗)] = 퐏[푋푖 = 푠, 푋푗 = t] for all {푖, 푗} ∈ 퐸, (푠, 푡) ∈ I × I (3.16)

The corresponding set ℳ is known as the marginal polytope associated with the graph 퐺 and is denoted ℳ(퐺). Explicitly, ℳ(퐺) is given by:

훍 ∈ ℝ푑 ∶ ∃퐏 such that: eq. (3.15) holds ∀푖 ∈ 푉, ∀ 푠 ∈ I and (3.17) 퓜(푮) = { } eq. (3.16) holds ∀{푖, 푗} ∈ 퐸, ∀(푠, 푡) ∈ I × I

36

3.1.4 The role of transformations between parameterizations

Various statistical computations, among them marginalization and maximum likelihood estimation, can be understood as transforming from one parameterization to the other.

The computation of the forward mapping, from canonical parameters 퐖 ∈ 훀 to mean parameters 훍 ∈ 퓜, can be viewed as a fundamental class of inference problems in exponential family models and is extremely difficult for many high–dimensional exponential families.

The computation of backward mapping, namely from mean parameters 훍 ∈ 퓜 to canonical parameters 퐖 ∈ 훀, also has a natural statistical interpretation. In particular, suppose that we are given a set of 푚 samples 핏 of a multivariate random variable X = (푋1, … , 푋푛):

(1) (m) T (j) (j) (j) (3.18) 핏 = (X , … , X ) where X = (푋1 , … , 푋n ) for 1 ≤ 푗 ≤ m

The samples are drawn independently from an exponential family member 퐏W(X) where the parameter W is unknown. If the goal is to estimate W, the classical principle of maximum likelihood dictates obtaining an estimate W̅ by maximizing the likelihood of the data, or equivalently (after taking logarithms and rescaling), maximizing the quantity:

푚 1 퓛(퐖, 핏) = ∙ ∑ ln (퐏 (X(j))) = 〈W, μ̅〉 (3.19) 푚 W 푗=1 where: 푚 1 훍̅ = 퐄̅[Φ (X)] = ∙ ∑ Φ (X(j)) (3.20) G 푚 G 푗=1 where μ̅ is the vector of empirical mean parameters defined by the data 핏. The maximum likelihood estimate W̅ is chosen to achieve the maximum of this objective function. Generally, computing W̅ is another challenging problem, since the objective function involves the cumulant function 퐴. Under suitable conditions, the maximum likelihood estimate is unique, and specified by the stationarity condition:

퐄W̅̅̅[ΦG(X)] = μ̅ (3.21)

Finding the unique solution to this equation is equivalent to computing the backward mapping μ → W and generally is computationally intensive.

37

3.2 The energy functional

In this section we introduce the concept of energy functional as a variational method for Markov random fields. In physics, the energy functional is the total energy of a certain system, as a functional of the system's state. In the context of Boltzmann machine, the energy functional acts as an alternative to Boltzmann–Gibbs distribution in the sense that it is more advantageous to maximize the energy functional instead of computing the partition function for the Boltzmann– Gibbs distribution. The majority of theoretical results presented in this section are taken from [27,47,52].

Let us consider we are given some complicated probabilistic system which is modelled by a

Markov random field with 푛 nodes and random variables 푋1, 푋2, … , 푋푛. We introduce a new random variable 퐏̃ that represents the unnormalized measure of the probability distribution 퐏 that describes the Markov random field. We rewrite the equation (2.11) in a way that highlights the unnormalized measure of 퐏.

1 퐏̃(푋 , 푋 , … , 푋 ) 퐏(푋 , 푋 , … , 푋 ) = ∙ ∏ 휙 (X ) = 1 2 푛 1 2 푛 푍 푐 c 푍 (3.22) Xc∈CG where: ̃ 퐏(푋1, 푋2, … , 푋푛) = ∏ 휙푐(Xc) (3.23) Xc∈CG and: ̃ 푍 = ∑ ∏ 휙푐(Xc) = ∑ 퐏(푋1, 푋2, … , 푋푛) (3.24) 푋1,푋2,…,푋푛 Xc∈CG 푋1,푋2,…,푋푛

Our goal is to construct an approximation for the joint distribution 퐏; we are going to name this new distribution 퐐. In order to reach this goal, we are going to employ the following strategy: instead of looking for a distribution equivalent to 퐏, we are looking for a distribution reasonably close to 퐏. Moreover, we want to make sure that we can perform inference efficiently in the given Markov random field by using 퐐. Therefore, instead of choosing a priori a single distribution 퐐, we firstly choose a family of approximating distributions ℚ = {퐐풊 ∶ 1 ≤ 푖 ≤ 푛}, then we let the optimization machinery to choose a particular member from this family.

In our journey to finding a decent approximation for 퐏 we are going to use the Kullback–Leibler divergence or KL–divergence which is defined in Appendix B (equation (B29)). If we explicitly write the expectation with respect to 퐐 in the definition of KL(퐐||퐏), then we obtain the following equation:

38

퐐(푋1, … , 푋푛) KL(퐐||퐏) = KL(퐐(푋1 … 푋푛)||퐏(푋1 … 푋푛)) = ∑ 퐐(푋1 … 푋푛) ∙ ln ( ) (3.25) 퐏(푋1, … , 푋푛) 푋1,…,푋푛 where: 퐏 is the “true” joint distribution from which the data was generated; 퐐 is a distribution from a certain family of distributions that, more or less, approximate 퐏; and

KL(퐐(푋1, … , 푋푛)||퐏(푋1, … , 푋푛)) is the KL–divergence of 퐐 and 퐏.

We observe that the computation of KL–divergence from the equation (3.25) involves an intractable operation which is the explicit summation over all possible instantiations of 푋1, … , 푋푛.

However, since we know from the equations (3.22) and (3.23) how 퐏(푋1, … , 푋푛) respectively

퐏̃(푋1, 푋2, … , 푋푛) look like, we can exploit this fact to rewrite the KL–divergence in a simpler form. Before we present this simplified form of KL–divergence, we need to introduce a few concepts related to energy respectively entropy in Markov random fields.

The entropy of 푋1, … , 푋푛 with respect to 퐐 is given by the equation (B27) from Appendix B:

푆퐐(푋1, … , 푋푛) = − ∑ 퐐(푋1, … , 푋푛) ∙ ln(퐐(푋1, … , 푋푛)) (3.26) 푋1,…,푋푛

Definition 3.1:

The energy functional 퐹[퐏̃, 퐐] of two probability distributions 퐏 and 퐐 is defined in connection with the unnormalized measure 퐏̃(푋1, 푋2, … , 푋푛) of 퐏 as:

퐹[퐏̃, 퐐] = 퐄퐐 [ln (퐏̃(푋1, … , 푋푛))] + 푆퐐(푋1, … , 푋푛) (3.27)

where 퐄퐐 [ln (퐏̃(푋1, … , 푋푛))] represents the expectation with respect to 퐐 of the logarithm of the unnormalized measure of 퐏 and 푆퐐(푋1, … , 푋푛) denotes the entropy of 푋1, … , 푋푛 with respect to 퐐.

Equivalent forms of the energy functional can be obtained by expanding the expectation in the first term and substituting the entropy in the second term of the equation (3.27):

퐹[퐏̃, 퐐] = 퐄퐐 [ln ( ∏ 휙푐(Xc)))] + 푆퐐(푋1, … , 푋푛)

Xc∈CG

39

퐹[퐏̃, 퐐] = 퐄퐐 [ ∑ ln(휙푐(Xc))] + 푆퐐(푋1, … , 푋푛)

Xc∈CG

̃ 퐹[퐏, 퐐] = ∑ 퐄퐐[ln(휙푐(Xc))] + 푆퐐(푋1, … , 푋푛) Xc∈CG (3.28) ̃ 퐹[퐏, 퐐] = ∑ 퐄퐐[ln(휙푐(Xc))] − ∑ 퐐(푋1, … , 푋푛) ∙ ln(퐐(푋1, … , 푋푛)) Xc∈CG 푋1,…,푋푛

The energy functional contains two terms:

 The first term, called the energy term, involves expectations with respect to 퐐 of the

logarithms of the factors in 휙푐. Here each factor in 휙푐 appears as a separate term. Thereby,

if the factors that comprise 휙푐 are small, and this is the case for the Boltzmann machine, each expectation deals with relatively few variables. The difficulties in dealing with these expectations depend on the properties of distribution 퐐. Assuming that inference is “easy” in 퐐, we should be able to evaluate such expectations relatively easily.  The second term, called the entropy term, is the entropy of 퐐. The choice of 퐐 determines whether this term is tractable, otherwise we can evaluate it.

The following theorem clarifies the relationship between the KL–divergence and the energy functional. The proof of this theorem is outside the scope of this paper. However, a proof can be found in [54].

Theorem 3.1:

The KL–divergence of the probability distributions 퐐 and 퐏 can be calculated using the formula:

KL(퐐(푋1, … , 푋푛)||퐏(푋1, … , 푋푛)) = −퐹[퐏̃, 퐐] + ln(푍(퐏)) (3.29) where 퐹[퐏̃, 퐐] is the energy functional given by Definition 3.1. and 푍(퐏) is the partition function of the probability distribution 퐏.

Equivalently, the equation (3.29) can be written using free energies as in [4]:

KL(퐐(푋1, … , 푋푛)||퐏(푋1, … , 푋푛)) = 퐹[퐐] − 퐹[퐏] (3.30)

40

where 퐹[퐐] is a variational free energy and 퐹[퐏] is the true free energy of the Markov random field.

The variational free energy is equal with the opposite of the energy functional 퐹[퐏̃, 퐐] and, generally, is a Gibbs free energy (equation (2.21)). The exact (true) free energy is the Helmholtz free energy (equation (2.22)). Without restricting the generality, in this section we can assume that the constant pseudo–temperature at equilibrium is 1. Therefore, the variational free energy and the true free energy can be written as:

퐹[퐐] = −퐹[퐏̃, 퐐] (3.31)

퐹[퐏] = − ln(푍(퐏)) (3.32)

If we incorporate the equations (3.28) into the equation (3.29) we obtain the following equivalent forms of the KL–divergence of probability distributions 퐐 and 퐏:

KL(퐐||퐏) = − ∑ 퐄퐐[ln(휙푐(Xc))] − 푆퐐(푋1, … , 푋푛) + ln(푍(퐏)) Xc∈CG (3.33)

KL(퐐||퐏) = − ∑ 퐄퐐[ln(휙푐(Xc))] + ∑ 퐐(푋1, … , 푋푛) ∙ ln(퐐(푋1, … , 푋푛)) + ln(푍(퐏)) Xc∈CG 푋1,…,푋푛

We observe that the term ln(푍(퐏)) from the equations (3.29) and (3.33) doesn’t depend on 퐐. Hence, if we want to minimize KL(퐐||퐏) with respect to 퐐, we just need to minimize the first two terms of the right–hand side of the equations (3.33), which means we need to either maximize the energy functional 퐹[퐏̃, 퐐] (equations (3.28)) or minimize the variational free energy 퐹[퐐] (equation (3.31)).

To summarize, instead of searching for a good approximation 퐐 of the true probability 퐏, we need to solve one of the following equivalent optimization problems:

 maximize the energy functional 퐹[퐏̃, 퐐];  minimize the variational free energy 퐹[퐐];  minimize the KL–divergence KL(퐐||퐏).

Choosing one of these problems depends on the specifics of the problem we try to solve. Importantly, the energy functional respectively the variational free energy involves expectations in 퐐. By choosing approximations 퐐 that allow for efficient inference, we can both evaluate the energy functional respectively the variational free energy and optimize it effectively.

41

Moreover, the KL–divergence enjoys the property of being always positive and becoming zero if and only if 퐐 and 퐏 are equal. The proof of this claim can be found in [54]:

KL(퐐||퐏) ≥ 0 (3.34) and: KL(퐐||퐏) = 0 if and only if 퐐 = 퐏 (3.35)

Then, from the equations (3.29) and (3.30) we can infer that:

퐹[퐏] ≥ 퐹[퐏̃, 퐐] (3.36) and: 퐹[퐏] ≤ 퐹[퐐] (3.37)

The inequalities (3.36) and (3.37) together with the equation (3.32) are significant because they provide bounds on the variational free energy respectively the energy functional:

 The variational free energy 퐹[퐐] is an upper bound for the true free energy 퐹[퐏] for any choice of 퐐. This translates into the result of the optimization problem “minimize 퐹[퐐]” being an upper bound for 퐹[퐏].  The energy functional 퐹[퐏̃, 퐐] is a lower bound for the true free energy 퐹[퐏] for any choice of 퐐. This translates into the result of the optimization problem “maximize 퐹[퐏̃, 퐐]” being a lower bound for 퐹[퐏].

Therefore, instead of directly computing the true partition function 푍(퐏), we can look for a decent approximation 퐐 of 퐏, Moreover, depending on the type of optimization employed (minimization or maximization), we should obtain a decent upper or lower bound of − ln(푍(퐏)), which becomes a decent lower or upper bound of ln(푍(퐏)), which leads to a decent lower– bound or upper–bound approximation of 푍(퐏).

3.3 Gibbs free energy revisited

In Section 2.3 we introduced the concept of Gibbs free energy in Markov random fields by analogy with the homologue concept from thermodynamics. In this section we firstly introduce the concepts of Hamiltonian and Plefka expansion; then we present two approaches for defining a variational Gibbs free energy in a Markov random field. Before we start, we mention that we use the notation (A2) from Appendix A to designate by X a multivariate random variable that

42

represents all the random variables 푋1, … , 푋푛 of a Markov random field and the notation (A3) from Appendix A to designate by X−i all the random variables from X except 푋푖.

3.3.1 Hamiltonian and Plefka expansion

The Hamiltonian mechanics is a theory developed as a reformulation of classical mechanics and predicts the same outcomes as the non–Hamiltonian (Newtonian) classical mechanics. It uses a different mathematical formalism, providing a more abstract understanding of the theory. Hamiltonian mechanics contributed to the subsequent formulation of statistical mechanics and quantum mechanics. Hamiltonian is an operator introduced by Hamiltonian mechanics that in most of the cases corresponds to the total energy of the system. For example, the Hamiltonian of a closed system is the sum of the kinetic energies of all the particles, plus the potential energy of the particles associated with the system.

Plefka expansion is an approximate method to compute free energies in physical systems. The method, originally applied to classical spin systems, can be applied to any model for which a transition from a 'trivial' disordered phase to an ordered phase occurs as some initially small parameter is varied [55]. That parameter need not be the inverse temperature [55]. In theory, Plefka expansion is a “high–temperature” expansion of the ordinary free energy of the system. Concretely, it is a Taylor expansion of the ordinary free energy with respect to the inverse temperature such that the resulted free energy is valid in both high–temperature and low–temperature phases of the system [55].

The concepts of Hamiltonian and Plefka expansion can be extended to Markov random fields.

We consider a pairwise Markov random field with binary random variables 푋1, … , 푋푛 defined by the equations (2.11) to (2.13) and whose joint probability distribution 퐏 is a Boltzmann–Gibbs distribution described by the equation (2.5). The canonical parameters 퐖 = (퐖̂ , 횯) of the Markov random field are given by the equation (3.5). The derivation by Plefka is particularly suitable for this type of Markov networks since it does not regard the parameters 푤푖푗 as random quantities and hence does not require averaging over them. Unlike the spin glass theory, where the parameters 푤푖푗 are generally regarded as random variables representing random interactions and their properties are analyzed in thermodynamic limit, otherwise their properties do not depend on a particular realization of 푤푖푗 [22], in Markov random field theory the

43

parameters 푤푖푗 are given and fixed, and hence in principle they cannot be thought of as random variables [22].

In Plefka’s argument [56-57], we associate the Hamiltonian H(훼) with 훼 the expansion parameter, to a given Markov random field:

1 H(훼) = −훼 ∙ ∙ ∑ 푤̂ ∙ 푋 ∙ 푋 − ∑ 휃 ∙ 푋 2 푖푗 푖 푗 푖 푖 (3.38) {푖,푗}∈퐸 푖∈푉

The free energy 퐹 corresponding to the Hamiltonian H(α) is given by the following formula [56- 57]:

−훽 ∙ 퐹[훼, 훽, {휇푖}푖∈푉] = ln(Tr exp(−훽 ∙ H(α))) − 훽 ∙ ∑ 휃푖 ∙ 휇푖 (3.39) 푖∈푉 where: 1 훽 = (3.40) 퐤 ∙ 퐓 And [ ( )] 휇푖 = 퐄푋푖 H 훼 for all 푖 ∈ 푉 (3.41) where:; 퐓 is the absolute temperature of the system; 퐤 is the Boltzmann's constant; 휇푖 is the mean value of the Hamiltonian H(α) with respect to the random variable 푋푖; and Tr denotes the trace of a matrix. Generally, in Markov networks the Boltzmann’s constant 퐤 can be taken equal to 1.

The Plefka expansion of the free energy 퐹 given by the equation (3.39) is obtained by suppressing the dependence of 퐹 on 훽 and {휇푖}푖∈푉 and then expanding 퐹 into a power series of 훼 as follows [56-57]:

∞ 훼푛 휕푛퐹 1 퐹[훼] = 퐹[0] + ∑ ∙ | = 퐹[0] + 훼 ∙ 퐹′[0] + ∙ 훼2 ∙ 퐹′′[0] + ⋯ (3.42) 푛! 휕훼푛 2 푛=1 훼=0

휕퐹 휕2퐹 where the derivatives with respect to 훼: 퐹′[훼] = , 퐹′′[훼] = , and so on should be taken with 휕훼 휕훼2

휇푖 fixed, for all 푖 ∈ 푉.

The coefficients of the Plefka expansion up to the second order are the following [56-57]:

1 1 + 휇 1 − 휇 퐹[0] = ∙ ∑ [(1 + 휇 ) ∙ ln ( 푖) + (1 − 휇 ) ∙ ln ( 푖)] 2 푖 2 푖 2 (3.43) 푖∈푉

1 퐹′[0] = − ∙ ∑ 푤̂ ∙ 휇 ∙ 휇 2 푖푗 푖 푗 (3.44) {푖,푗}∈퐸

44

1 퐹′′[0] = − ∙ ∑ 푤̂ 2 ∙ (1 − 휇 2) ∙ (1 − 휇 2) 2 푖푗 푖 푗 (3.45) {푖,푗}∈퐸

Together with the equations (3.43) to (3.45) the equation (3.42) becomes:

1 1 + 휇 1 − 휇 훽 ∙ 퐹[훼] = ∙ ∑ [(1 + 휇 ) ∙ ln ( 푖) + (1 − 휇 ) ∙ ln ( 푖)] − 2 푖 2 푖 2 푖∈푉 2 (3.46) 훽 ∙ 훼 훽 ∙ 훼 − ∙ ∑ 푤̂ ∙ 휇 ∙ 휇 − ( ) ∙ ∑ 푤̂ 2 ∙ (1 − 휇 2) ∙ (1 − 휇 2) + O(훼3) 2 푖푗 푖 푗 2 푖푗 푖 푗 {푖,푗}∈퐸 {푖,푗}∈퐸

Since 퐇(훼 = 1) is the original Hamiltonian to be considered, leaving the convergence problem aside and neglecting the higher–order terms O(훼3), setting 훼 = 1 in the equation (3.46) yields the true free energy of the Markov random field: 퐹[훼] ≡ 퐹[퐏] [56-57]. The free energy obtained by truncating the Plefka expansion of the ordinary free energy is a Gibbs free energy as well.

3.3.2 The Gibbs free energy as a variational energy

We consider we are given the Markov random field described in Section 3.3.1 and we denote by X the set of its random variables. We are given a proper subset Y of X together with 풫, which denotes the set of marginal probabilities with respect to 퐏 of all the variables belonging to Y.

Our task is to define a Gibbs free energy for this Markov random field by performing a partial constrained minimization over a distribution 퐐 of certain form such that the marginals of 퐏 corresponding to the variables included in Y are kept in 퐐. The intended optimization is a partial constrained minimization, where the term partial signals the fact that only some of the random variables 푋1, … , 푋푛 are constrained.

Formally this optimization task is represented as:

Given:

X = (푋1, … , 푋푛) ⊃ Y = {푋푖 , 푋푖 , … , 푋푖 } = {푋푖 } (3.47) 1 2 푚 푗 1≤푗≤푚 such that: { } { } ∀푗 ∈ 1, … , 푚 , ∃푘 ∈ 1. . 푛 such that 푋푘 ≡ 푋푖푗 (3.48) and given: 풫 = {푝 , 푝 , … , 푝 } = {푝 } 1 2 푚 푗 1≤푗≤푚 (3.49) where:

45

푝 = MARG (퐏, 푋 ) = ∑ 퐏(푋 , … , 푋 , … , 푋 ) with 푋 ≡ 푋 cf. (3.48) 푗 푖푗 1 푘 푛 푖푗 푘 (3.50) X −ij Construct:

퐺 [{푝푗} ] = min {퐹[퐐] ∶ MARG (퐐, 푋푖 ) = 푝푗, 1 ≤ j ≤ m} (3.51) 푗 퐐 푗 such that: 1 퐐 (Y) = ∙ exp(−퐸(Y)) = 퐏(Y) (3.52) 퐄퐐 푍

In order to accomplish this task, we could follow any of the following approaches:

 Approach 1:

 Step 1.1: We obtain a Gibbs free energy by truncating up to the nth–order term the Plefka expansion of the ordinary free energy (formulae (3.42) and (3.46)).  Step 1.2: We obtain another Gibbs free energy, which generally is a variational Gibbs free energy, by minimizing the Gibbs free energy obtained in Step 1.1 over the

parameters {푝푗}푗.

 Approach 2:

 Step 2.1: We obtain a Gibbs free energy by using the formula (2.21).  Step 2.2: We obtain another Gibbs free energy, which generally is a variational Gibbs free energy, by minimizing the Gibbs free energy obtained in Step 2.1 over the

parameters {푝푗}푗.

In this section we are going to exemplify Approach 1. Step 1.1 is explained in Section 3.3.1 so we are going to show only how to perform Step 1.2. In order to achieve this goal, we follow the work of Welling and Teh [4].

The natural way to enforce the constraints on the marginals is by employing a set of Lagrange multipliers {휆푗}푗 and incorporating them in the approximation of the free energy 퐹 obtained in Step 1.1:

퐹[퐐] ← 퐹[퐐] − ∑ 휆 ∙ (MARG (퐐, 푋 ) − 푝 ) 푗 푖푗 푗 (3.53) 푗

We minimize 퐹[퐐] over 퐐 in terms of the Lagrange multipliers {휆푗}푗 and the parameters {푝푗}푗. The solution obtained is again a Boltzmann–Gibbs distribution, but with a modified energy which includes additional bias terms:

46

퐸 ({푋푖 } ) → 퐸 ({푋푖 } ) − ∑ 휆푗 ∙ 푋푖 푗 푗 푗 푗 푗 (3.54) 푗

After inserting the expression (3.54) into the free energy given by (3.53) and minimizing over the

Lagrange multipliers {휆푗}푗, we find the values of {휆푗}푗 as a function of the parameters {푝푗}푗. The resulted Gibbs free energy is:

퐺 [{푝 } ] = min {∑ 휆 ∙ 푝 − ln 푍 ({휆 } )} (3.55) 푗 푗 푗 푗 푗 푗 {휆푗} 푗 푗 where 푍 ({휆 } ) is the normalizing constant for the Boltzmann–Gibbs distribution with energy 푗 푗 defined by the equation (3.54). The equation (3.55) is known as the Legendre transform between {휆 } and {푝 } . By shifting the Lagrange multipliers as follows: 푗 푗 푗 푗

′ 휆푗 = 휆푗 + 휃푗 (3.56) we can pull the contribution of the thresholds to the Gibbs free energy out of the Legendre transform and obtain another form of the resulted Gibbs free energy:

퐺 [{푝 } ] = − ∑ 휃 ∙ 푝 + min {∑ 휆 ′ ∙ 푝 − ln 푍′ ({휆 ′} )} (3.57) 푗 푗 푗 푗 푗 푗 푗 푗 {휆푗′} 푗 푗 푗 where 푍′ is the partition function of the modified Boltzmann–Gibbs distribution with all the thresholds {휃푗}푗 set to zero:

푍′ ({휆 ′} ) = ∑ exp (− ∑ 푤̂ ∙ 푋 ∙ 푋 − ∑ 휆 ′ ∙ 푋 ) (3.58) 푗 푗 푗푙 푖푗 푖푙 푗 푖푗 푋 {푗,푙} 푖푗

Various variational Gibbs free energies can be obtained by following this approach. For instance, the mean field free energy is the Gibbs free energy obtained by truncating the Plefka expansion of the free energy (equation (3.46)) in the first order and minimizing it with respect to single node marginals.

47

3.4 Mean field approximation

The mean field approximation is a variational approximation of the true free energy, or, equivalently, of the energy functional, over a computationally tractable family ℚ of simple distributions:

ℚ = {퐐퐢 ∶ 1 ≤ 푖 ≤ 푛} (3.59)

The mean field approximation of the true free energy is called the mean field free energy and is a Gibbs free energy. In this section our goal is to obtain an expression for the mean field free energy by maximizing the energy functional. In Section 3.5 we will show how we can obtain an equivalent expression for the mean field free energy in relation with the Bethe free energy.

The fact that the distributions 퐐 are tractable comes with a cost: they are not generally sufficiently expressive to capture all the information of the true probability distribution 퐏. Before we present the simplest mean field algorithm, often called naïve mean field, we introduce the following notations:

 We use the notation (A2) from Appendix A to denote a multivariate random variable that

represents either all the random variables of the Markov random field X = (푋1, 푋2, … , 푋푛), or

all the random variables belonging to a specific clique Xc whose potential 휙푐 would appear in the joint probability distribution 퐏 of the random field.

 We use the notation (A3) from Appendix A to designate by X−i all the random variables

from X except 푋푖.

 We use the notation 퐄Y~퐐 to designate the expectation of probability 퐐 with respect to all the random variables of the given Markov random field “contained” in the multivariate random

variable Y. A few examples of this notation are: 퐄X−i~퐐 and 퐄Xc~퐐 .

3.4.1 The mean field energy functional

The naïve mean field algorithm looks for the distribution 퐐 closest to the true distribution 퐏 in terms of KL(퐐||퐏) inside the class of distributions representable as a product of independent marginals:

퐐(푋1, … , 푋푛) = ∏ 퐐(푋푖) (3.60) 1≤푖≤푛

48

A few observations should be made regarding the equation (3.60).

 On one hand the approximation of 퐏 as a fully factored distribution assumes that all

variables 푋1, … , 푋푛 of 퐏 are independent of each other in 퐐. Consequently this approximation doesn’t capture any of the dependencies existing in 퐏 between the variables belonging to a

clique Xc, for all Xc ∈ CG, i.e., the dependencies reflected by the clique potentials 휙푐(Xc) from the equation (3.23).  On other hand this distribution is computationally attractive since we can evaluate any query on 퐐 by a product over terms that involve only the variables in the scope of the query (i.e., the set of variables that appear in the query). Moreover, to represent 퐐 we need only to

describe the marginal probabilities of each of the variables 푋1, … , 푋푛.

 In machine learning literature the marginal probabilities 퐐(푋푖), where 1 ≤ 푖 ≤ 푛, are usually

called mean parameters and denoted 흁풊 [4,5,16-23,27,47].

Before we derive the mean field algorithm, we are going to formulate the energy functional in a slightly different way. We do this by incorporating the formula (3.60) into the formulae (3.28):

퐹[퐏̃, 퐐] = ∑ 퐄X~퐐[ln(휙푐(Xc))] − ∑ ∏ 퐐(푋푖) ∙ ln ( ∏ 퐐(푋푖)) (3.61)

Xc∈CG 푋1,…,푋푛 1≤푖≤푛 1≤푖≤푛

In the equation (3.61) the first term of the energy functional is itself a sum of terms that have the form 퐄X~퐐[ln(휙푐(Xc))]. In order to evaluate these terms, we can use the equation (3.60) to compute 퐐(휙푐(Xc)) as a product of marginals, allowing the evaluation of this term to be performed in time linear in the number of random variables of the clique Xc.

퐐(휙푐(Xc)) = ∏ 퐐(푋푖) for all Xc ∈ CG, 휙푐 ∈ ΦG (3.62) 푋푖∈Xc

Then the cost to evaluate 퐐(푋1, … , 푋푛) overall is linear in the description size of the factors 휙푐 of 퐏. As for now we cannot expect to do much better.

퐄X~퐐[ln(휙푐(Xc))] = ∑ 퐐(휙푐(Xc)) ∙ ln(휙푐(Xc)) Xc∈CG (3.63)

퐄X~퐐[ln(휙푐(Xc))] = ∑ ( ∏ 퐐(푋푖)) ∙ ln(휙푐(Xc))

Xc∈CG 푋푖∈Xc

49

The second term of the energy functional in the equation (3.61) is the entropy of 푋1, … , 푋푛 with respect to 퐐 and, for a fully factored distribution 퐐, is also decomposable as follows. The proof for this claim can be found in [54].

푆퐐(푋1, … , 푋푛) = − ∑ ∏ 퐐(푋푖) ∙ ln ( ∏ 퐐(푋푖))

푋1,…,푋푛 1≤푖≤푛 1≤푖≤푛

푆퐐(푋1, … , 푋푛) = − ∑ ∏ 퐐(푋푖) ∙ ∑ ln(퐐(푋푖)) 푋1,…,푋푛 1≤푖≤푛 1≤푖≤푛

푆퐐(푋1, … , 푋푛) = ∑ 푆퐐(푋푖) (3.64) 1≤푖≤푛

We substitute the appropriate quantities given by the equations (3.63) and (3.64) into the equation (3.61). Finally, the energy functional respectively the corresponding variational free energy for a fully factored distribution 퐐 is given by the following formula:

퐹[퐏̃, 퐐] = −퐹푀퐹[퐐] = ∑ ( ∏ 퐐(푋푖)) ∙ ln(휙푐(Xc)) + ∑ 푆퐐(푋푖) (3.65)

푋1,…,푋푛 푋푖∈Xc 1≤푖≤푛 where 퐹푀퐹[퐐] is the mean field free energy.

The formula (3.67) shows that the energy functional for a fully factored distribution can be written as a sum of expectations, each expectation being defined over a small set of variables; each such a set corresponds to a clique potential 휙푐(Xc) in 퐏. The complexity of evaluating this form of the energy functional depends on the size of factors 휙푐(Xc) in 퐏 and not on the topology of the Markov network. Thus, the energy functional can be represented and manipulated effectively, even in Markov networks that would require exponential time for exact inference.

3.4.2 Maximizing the energy functional: fixed–point characterization

In Section 3.2 we showed that, instead of searching for a good approximation 퐐 of the true probability 퐏, we could use a variational approach to either maximize the energy functional or minimize either the corresponding variational free energy or the KL–divergence. Each of these approaches transforms the original problem – approximate inference in a Markov random field – into an optimization problem. An interesting aspect of these optimization problems is the fact

50

that, instead of approximating the objective, they approximate the optimization space. This is done by starting with a class of distributions:

ℚ = {퐐퐢 = 퐐(푋푖) ∶ 1 ≤ 푖 ≤ 푛} (3.66) that generally doesn’t contain the true distribution 퐏. Then, the distribution of this class that complies with the type of optimization performed and with the imposed constraints represents an approximation of the true probability of the underlying Markov network.

A formal description of the optimization problem “maximization of energy functional” follows.

Problem Mean Field Approximation:

Given: ℚ = { 퐐퐢 = 퐐(푋푖) ∶ 1 ≤ 푖 ≤ 푛 }

Find: 퐐 ∈ ℚ

By maximizing: 퐹[퐏̃, 퐐]

Subject to: 퐐(푋1, … , 푋푛) = ∏푖 퐐(푋푖)

∑ {푥푖} 퐐(푥푖) = 1 for all 푖

The following theorem and corollaries provide a set of fixed–point equations that characterize the stationary points of Mean Field Approximation. These theoretical results are taken from [54] and adapted to our notations and conventions. We provide the proofs only for Theorem 3.2 and Corollary 3.5. We also provide our interpretation of these theoretical results.

Theorem 3.2:

The 퐐(푋푖) is a local maximum of Mean Field Approximation given

{퐐(푋푗)}푗≠푖 if and only if:

1 퐐(푥푖) = ∙ exp { ∑ 퐄X~퐐[ln 휙푐 | 푥푖] } 푍푖 휙푐∈ΦG equivalent to: (3.67)

1 퐐(푥푖) = ∙ exp { ∑ 퐄X~퐐[ln(휙푐(Xc)) | 푥푖] } 푍푖 Xc∈CG

51

where 푍푖 is a local normalizing constant and 퐄X~퐐[ln 휙푐 | 푥푖] is the conditional expectation for a given value 푥푖 of 푋푖:

퐄X~퐐[ln 휙푐 | 푥푖] = ∑ 퐐(휙푐(Xc) | 푥푖) ∙ ln(휙푐(Xc)) Xc∈CG

Proof: The proof of this theorem relies on proving the fixed–point characterization of the individual marginal 퐐(푋푖) in terms of the other components 퐐(푋1),…, 퐐(푋푖−1), 퐐(푋푖+1),…, 퐐(푋푛) as specified in the equation (3.67).

We first consider the restriction of the objective 퐹[퐏̃, 퐐] to those terms that involve 퐐(푋푖):

퐹 [퐐] = ∑ 퐄 [ln(휙 (X ))] + 푆 (푋 ) 푖 Xc~퐐 푐 c 퐐 푖 (3.68) 푋푖∈Xc, Xc∈CG

To optimize 퐐(푋푖), we define the Lagrangian that consists of all terms in 퐹[퐏̃, 퐐] that involve

퐐(푋푖):

[ ] ( ) ( ) ( ) 퐿푖 퐐 = ∑ 퐄Xc~퐐[ln(휙푐 Xc )] + 푆퐐 푋푖 + 휆 ∙ (∑ 퐐 푥푖 − 1) 푋푖∈Xc, 푥푖 Xc∈CG where 휆 is a Lagrange multiplier that corresponds to the constraint that 퐐(푋푖) is a distribution.

We now take derivatives with respect to 퐐(푥푖). The following result, whose proof we do not provide, plays an important role in the remainder of the derivation.

Lemma 3.3:

If 퐐(X) = ∏푖 퐐(푋푖), then for any function 푓 with scope 풰:

휕퐄풰~퐐[푓(풰)] = 퐄풰~퐐[푓(풰) | 푥푖] 휕퐐(푥푖)

Using Lemma 3.3 and standard derivatives of entropies, we see that:

휕퐿푖 = ∑ 퐄X~퐐[ln(휙푐(Xc)) | 푥푖] − ln 퐐(푥푖) − 1 + 휆 휕퐐(푥푖) Xc∈CG

Setting the derivative to 0 and rearranging terms we get that:

52

ln 퐐(푋푖) = 휆 − 1 + ∑ 퐄X~퐐[ln(휙푐(Xc)) | 푥푖]

Xc∈CG

We take exponents of both sides and renormalize; because 휆 is constant relative to 푥푖, it drops out in the renormalization, so that we obtain the formula (3.67).

Theorem 3.2 shows only that the solution of the equation (3.67) is a stationary point of the equation (3.68). To prove that it is a maximum, we note that the equation (3.68) is a sum of two ∑ ( ) ( ) terms: Xc∈CG 퐄X~퐐[ln(휙푐 Xc )] is linear in 퐐 푋푖 , given all the other components 퐐(푋푗), 푗 ≠ 푖;

푆퐐(푋푖) is a concave function in 퐐(푋푖). As a whole, given the other components of 퐐, the function

퐹푖 is concave in 퐐(푋푖) and, therefore, has a unique global optimum, which is easily verified to be the solution of the equation (3.67) rather than any of the extremal points [54].

The following two corollaries help characterize the stationary points of Mean Field Approximation.

Corollary 3.4:

The distribution 퐐 is a stationary point of Mean Field Approximation if and only if, for each 푋푖, the equation (3.67) holds.

Corollary 3.5:

In the mean field approximation, 퐐(푥푖) is locally optimal only if:

1 ( ) [ ( )] 퐐 푥푖 = ∙ exp{퐄X−i~퐐 ln 퐏 푥푖 | X−i } (3.69) 푍푖 where 푍푖 is a normalizing constant.

̃ ∏ ∏ ( ) Proof: We recall that 퐏 = 휙푐∈ΦG 휙푐 = Xc∈CG 휙푐 Xc is the unnormalized measure defined by

ΦG and CG. Due to the linearity of expectation we have:

∑ 퐄X~퐐[ln 휙푐 | 푥푖] = 퐄X~퐐[ln 퐏̃(푋푖, X−i) | 푥푖]

휙푐∈ΦG

Because 퐐 is a product of marginals, we can rewrite 퐐( X−i | 푥푖 ) as 퐐( X−i ) and get that:

̃( ) ̃( ) 퐄X~퐐[ln 퐏 푋푖, X−i | 푥푖] = 퐄X−i~퐐[ln 퐏 푥푖, X−i ]

53

Using properties of conditional distributions, it follows that:

퐏̃(푥푖, X−푖 ) = 푍 ∙ 퐏(푥푖, X−푖 ) = 푍 ∙ 퐏( X−푖 ) ∙ 퐏(푥푖 | X−i)

We conclude that:

[ ] [ ( )] [ ( ( ) )] ∑ 퐄X~퐐 ln 휙푐 | 푥푖 = 퐄X−i~퐐 ln 퐏 푥푖 | X−i + 퐄X−i~퐐 ln 퐏 X−푖 ∙ 푍 휙푐∈ΦG

Plugging this equality into the update equation (3.67) we get that:

1 ( ) [ ( )] [ ( ( ) )] 퐐 푥푖 = ∙ exp{퐄X−i~퐐 ln 퐏 푥푖 | X−i } ∙ exp{퐄X−i~퐐 ln 퐏 X−푖 ∙ 푍 } 푍푖

The term ln(퐏( X−푖 ) ∙ 푍) does not depend on the value of 푥푖. More, when a marginal is multiplied by a constant factor, it does not change the joint distribution. In fact, as the distribution is renormalized at the end to sum to 1, the constant is absorbed into the normalizing function to achieve normalization. Therefore, the constant term can be simply ignored and the formula we obtain is exactly the formula (3.69).

Corollary 3.5 shows that the marginal of 푋푖 in 퐐, i.e., 퐐(푋푖), is the geometric average of the conditional probability of 푥푖 given all other variables in the domain. The average is based on the probability that 퐐 assigns to all possible assignments to the variables in the domain. In this sense, the mean field approximation requires that the marginal of 푋푖 be “consistent” with the marginals of other variables [54].

Comparatively, the marginal of 푋푖 in 퐏 can be represented as an arithmetic average:

퐏(푥 ) = ∑ 퐏(푥 ) ∙ 퐏(푥 | 푥 ) = 퐄 [퐏(푥 | X )] 푖 −i 푖 −i X−i~퐏 푖 −i (3.70) 푥−푖

In general, the geometric average tends to lead to marginals that are more sharply peaked than the original marginals in 퐏. More significant, the expectation in the equation (3.69) is relative to the approximation distribution 퐐, while the expectation in the equation (3.70) is relative to the true distribution 퐏. However this should not be interpreted as our approximation in 퐐 of the marginals in 퐏 is a good one [54].

54

3.4.3 Maximizing the energy functional: the naïve mean field algorithm

We start by observing that, if a clique with potential function 휙푐 and set of nodes Xc doesn’t contain the variable 푋푖, i.e., 푋푖 ∉ Xc, then:

[ ( ( ) | )] [ ( ) ] 퐄Xc~퐐 ln 휙푐 Xc 푥푖 = 퐄Xc~퐐 ln(휙푐 Xc, 푥푖 ) when 푋푖 ∉ Xc (3.71)

Hence, expectations terms on such factors are independent of the value 푋푖. Consequently, we can absorb them into the normalization constant 푍푖 and get the following simplification.

Corollary 3.6:

In the mean field approximation 퐐(푥푖) is locally optimal only if:

1 ( ) [ ( ) ] 퐐 푥푖 = ∙ exp ∑ 퐄Xc−{푋푖}~퐐 ln(휙푐 Xc, 푥푖 ) (3.72) 푍푖 푋푖∈Xc, {Xc∈CG } where 푍푖 is the normalization constant.

The equation (3.72) shows that 퐐(푥푖) has to be consistent with the expectation of the potentials in which it appears. This characterization of Corollary 3.3 is very useful for converting the fixed–point equations (3.67) into an algorithm that maximizes 퐹[퐏̃, 퐐]. All the terms on the right– hand side of the equation (3.70) involve expectations of variables other than 푋푖 and do not depend on the choice of 퐐(푋푖). We can achieve equality simply by evaluating the exponential terms for each value 푥푖, normalizing the results to sum to 1, and then assigning them to 퐐(푋푖).

As a consequence, we reach the optimal value of 퐐(푋푖) in one step.

The last statement should be interpreted with some care. The resulting value for 퐐(푋푖) is its optimal value given the choice of all other marginals. Thus, this step optimizes the function

퐹[퐏̃, 퐐] relative only to one single coordinate in the space – the marginal of 퐐(푋푖). To optimize the function in its entirety, we need to optimize relative to all the coordinates. We can embed this step in an iterated coordinate ascent algorithm, which repeatedly maximizes a single marginal at a time, given fixed choices to all of the others. The result is Algorithm 3.1.

55

Algorithm 3.1: Naïve Mean Field Approximation

Given: CG , ΦG , 퐐ퟎ // the initial choice of 퐐

begin

Step 1: 퐐 ← 퐐ퟎ

Step 2: 푢푛푝푟표푐푒푠푠푒푑 ← X = (푋1, 푋2, … , 푋푛)

Step 3: while 푢푛푝푟표푐푒푠푠푒푑 ≠ ∅ do

Step 4: choose 푋푖 from 푢푛푝푟표푐푒푠푠푒푑

Step 5: 퐐퐨퐥퐝(푋푖) ← 퐐(푋푖)

Step 6: for 푥푖 ∈ val(푋푖) do

we iterate over all possible values of random variable 푋푖.

( ) ∑ [ ( )] Step 7: 퐐 푥푖 = exp( 푋푖∈Xc 퐄Xc−{푋푖}~퐐 ln 휙푐 Xc, 푥푖 )

end for // 푥푖

Step 8: normalize 퐐(푥푖) to sum to 1

Step 9: if 퐐퐨퐥퐝(푋푖) ≠ 퐐(푋푖) then

Step 10: 푢푛푝푟표푐푒푠푠푒푑 ← 푢푛푝푟표푐푒푠푠푒푑 ∪ (⋃푋푖∈Xc Xc) Xc∈CG

Step 11: 푢푛푝푟표푐푒푠푠푒푑 ← 푢푛푝푟표푐푒푠푠푒푑 − {푋푖}

end while

end

return 퐐

Importantly, a single optimization doesn’t usually suffice; a subsequent modification to another marginal 퐐(푋푗) may result in a different optimal parameterization for 퐐(푋푖). Therefore, the algorithm repeats these steps until convergence. A key property of the coordinate ascent procedure is that each step leads to an increase in the energy functional. Hence, any iteration of Algorithm 3.1 results in a better approximation of the true distribution 퐏.

56

Theorem 3.7:

Algorithm 3.1 is guaranteed to converge. Moreover, the distribution returned by algorithm is a stationary point of 퐹[퐏̃, 퐐], subject to the constraint that 퐐(X) = ∏1≤푖≤푛 퐐(푋푖) is a distribution.

The proof of Theorem 3.7 is outside the scope of this paper. Interested parties can find the proof in [54]. The distribution returned by Algorithm 3.1 is a stationary point of the energy functional, so theoretically it could be: a local maximum, a local minimum, or a saddle point. However, it cannot be either a local minimum or a saddle point because these stationary points are not stable convergence points of the algorithm in the sense that a small perturbation of 퐐 followed by optimization will lead to a better convergence point [54]. Because the algorithm is unlikely to accidentally land precisely on such an unstable point and get stuck there, in practice the convergence points of the algorithm are local maximum and not necessarily global maximum [54].

3.5 Bethe approximation

The Bethe approximation is an approximate of the free energy similar to the mean field approximation. The Bethe approximation reduces the problem of computing the partition function in a Markov random field to that of solving a set of non-linear equations – the Bethe fixed–point equations [58]. In this section we start by introducing the Bethe free energy and its “close relative” the Bethe–Gibbs free energy. Then, we describe briefly the belief propagation (BP) algorithm and we present the theoretical result due to Yedidia et al. [28,35] that establishes the connection between BP fixed–points and Bethe free energy. Because this result is considered fundamental for approximate inference research, we include its original proof. We end this section by introducing a new approximate inference algorithm, due to Welling and Teh [4], which is based on the Bethe free energy and called belief optimization (BO).

We assume that we are given a pairwise Markov network with binary variables described by the equations (2.11) to (2.13) and (2.15). We rewrite the equation (2.11) by taking into consideration the fact that the potential functions, given by the equation (2.15), are either node potentials or edge potentials:

57

1 1 퐏(X) = ∙ ∏ 휙 (X ) = ∙ ∏ 휙 (푋 ) ∙ ∏ 휙 (푋 , 푋 ) 푍 푐 c 푍 푖 i 푖푗 i 푗 (3.73) Xc∈CG i 푖,푗 where: 휙푖(푋i) is the local “evidence” for node 푖; 휙푖푗(푋i, 푋푗) is the clique potential that corresponds to the edge that connects the nodes 푖 and 푗; and 푍 is the partition function. Any fixed evidence nodes is subsumed into our definition of 휙푖(푋i).

We denote by 푝푖 the marginal probabilities over singleton variables and by 푝푖푗 the pairwise marginal probabilities over pairs of variables that correspond to edges in the underlying graph:

푝푖 = MARG(퐏, 푋푖) = ∑ 퐏(푋1, … , 푋푖, … , 푋푛) X−i (3.74)

푝푖푗 = MARG(퐏, 푋푖푋푗) = ∑ 퐏(푋1, … , 푋푖, … , 푋푗, … , 푋푛) X−{푋푖,푋푗}

The majority of theoretical results presented in this section come from [4, 28-29, 35-36].

3.5.1 The Bethe free energy

The original formula for the Bethe free energy, proposed by Yedidia et al. in [28-29,35-36], relies on a minimal canonical representation for the Markov network. An alternative form of the Bethe free energy that relies on the mean parameterization of the Markov network was introduced by Wainwright et al. in [47].

The Bethe free energy is the Gibbs free energy obtained by truncating the Plefka expansion of the free energy (equation (3.46)) in the second order and minimizing it with respect to single node marginals and pairwise marginals. Unlike the mean field free energy, which depends only on approximate marginals at single nodes, the Bethe free energy depends on approximate marginals at single nodes as well as approximate marginals on edges.

To understand the relationship between the Bethe free energy and the mean field free energy, we define a “close relative” of both by imposing additional constraints on the Bethe free energy. The Gibbs free energy resulted from this additional optimization is called the Bethe–Gibbs free energy. To distinguish between the Bethe free energy and the Bethe–Gibbs free energy, we denote the former by 𝒢훽 and the latter by 퐺훽.

58

We assume that we work under a set of hypothesis similar to ones described by (3.47) to (3.52) except that Y = X. The constraints are represented by the set of all 푝푖 and the set of all 푝푖푗 defined in (3.74). Formally, the Bethe free energy is defined as:

𝒢훽[{푝푖}, {푝푖푗}] = min{퐹[퐐] ∶ MARG(퐐, 푋푖) = 푝푖 and MARG(퐐, 푋푖푋푗) = 푝푖푗} 퐐 (3.75) where MARG(퐐, 푋푖) denotes the singleton marginal probability of 퐐 with respect to 푋푖 and

MARG(퐐, 푋푖푋푗) denotes the pairwise marginal probability of 퐐 with respect to the variables 푋푖 and 푋푗, whose corresponding nodes are connected in the Markov network.

In order to compute the Bethe free energy of a binary pairwise Markov network, in this section we follow Approach 2 described in Section 3.3.2. As previously mentioned, we can assume that the constant pseudo–temperature at equilibrium is 1. Therefore:

𝒢훽[퐏(푋1, … , 푋푛)] = 푈훽 − 푆훽 (3.76)

In Section 2.3 we learned that the energy of a pairwise Markov network is a quadratic function of the states, so the internal energy given by the formula (2.18) can be computed exact in terms of {푝푖} and {푝푖푗} [4]:

푈훽 = 퐄퐏 [퐸[{푝푖}, {푝푖푗}]] = 퐸[{푝푖}, {푝푖푗}] where: (3.77)

퐸[{푝푖}, {푝푖푗}] = − ∑ 푝푖푗 ∙ 푤̂푖푗 + ∑ 푝푖 ∙ 휃푖 {푖,푗} 푖

This means that the computation of the Bethe free energy (3.76) requires an approximation only for the entropy term (equation (2.19)). The idea is that we want to correct the mean field approximation which overestimates the entropy due to its assumption that all nodes are independent. The natural next step is to take pairwise dependencies into account. But just adding all pairwise entropy contributions to the mean field approximation would clearly over– count the entropy contributions at the nodes. Correcting for this over–counting gives the following approximation to the entropy [4]:

푆훽[{푝푖}, {푝푖푗}] = ∑ 푆푖 + ∑(푆푖푗 − 푆푖 − 푆푗) = ∑ 푆푖 ∙ (1 − 푑푖) + ∑ 푆푖푗 (3.78) 푖 {푖,푗} 푖 {푖,푗} where: 푑푖 is the degree of node 푖, i.e., the number of neighbors of node 푖; 푆푖 is the mean field entropy for node 푖; and 푆푖푗 is the pairwise entropy. The mean field entropy 푆푖 and the pairwise entropy 푆푖푗 can be written as:

59

푆푖 = −(푝푖 ∙ ln 푝푖 + (1 − 푝푖) ∙ ln(1 − 푝푖)) (3.79)

푆 = −(푝 ∙ ln 푝 + (푝 + 1 − 푝 − 푝 ) ∙ ln(푝 + 1 − 푝 − 푝 ) + 푖푗 푖푗 푖푗 푖푗 푖 푗 푖푗 푖 푗 (3.80) +(푝푖 − 푝푖푗) ∙ ln(푝푖 − 푝푖푗) + (푝푗 − 푝푖푗) ∙ ln(푝푗 − 푝푖푗))

The Bethe free energy is obtained by integrating (3.76), (3.77), and (3.79) into (2.21):

𝒢훽[{푝푖}, {푝푖푗}] = 퐸[{푝푖}, {푝푖푗}] − 푆훽[{푝푖}, {푝푖푗}] (3.81)

𝒢훽 = − ∑ 푝푖푗 ∙ 푤̂푖푗 + ∑ 푝푖 ∙ 휃푖 + ∑(푝푖 ∙ ln 푝푖 + (1 − 푝푖) ∙ ln(1 − 푝푖)) ∙ (1 − 푑푖) + {푖,푗} 푖 푖 (3.82) + ∑(푝푖푗 ∙ ln 푝푖푗 + (푝푖푗 + 1 − 푝푖 − 푝푗) ∙ ln(푝푖푗 + 1 − 푝푖 − 푝푗) + (푝푖 − 푝푖푗) {푖,푗} ∙ ln(푝푖 − 푝푖푗) + (푝푗 − 푝푖푗) ∙ ln(푝푗 − 푝푖푗))

The expression (3.76) for the entropy is exact when the underlying graph is a tree. Since the expression (3.77) for the energy is exact for general Boltzmann machines, it is also exact on Boltzmann trees. Consequently, the Bethe free energy (3.82) is exact on trees [4]. If the underlying graph has loops, then the distribution corresponding to the energy given by (3.82) is not always a properly normalized probability distribution [4]. Therefore, the Bethe free energy is not necessarily an upper bound for the true free energy 퐹 [4], otherwise it does not fall into the category of variational free energies characterized by (3.36) and (3.37). So, when can we expect the Bethe free energy to be a good approximation for the free energy of the system? The above argument suggests that this should be the case when the graph is “close to a tree”, i.e., if there are not many short loops in the graph. If the underlying graph has tight loops, evidence impinging on one node can travel around these loops and return back to the original node, causing it to be over–counted [4].

3.5.2 The Bethe–Gibbs free energy

In order to improve the approximation of the free energy, the Bethe free energy has been studied in connection with a well–known free energy: the mean field free energy. Welling and Teh proved that the mean field free energy is a small weight expansion of the Bethe free energy [4], which suggests that the Bethe free energy should be accurate for small weights and should improve on the mean field energy [4]. In this section we use a different approach to explore the relationship between the Bethe free energy and the mean field free energy: via the Bethe–Gibbs free energy.

60

We recall the mean field approximation of the free energy (equation (3.65)) and we observe that the expression of the entropy 푆퐐(푋푖) is the same as the mean field entropy 푆푖 given by (3.79).

We also note that the independent marginal 퐐(푋푖) corresponds to 푝푖 given by (3.74) and the clique potential 휙푐(Xc) corresponds to 휙푖푗(푋i, 푋푗) given by (3.73). Hence, the mean field free energy can be written as:

퐹푀퐹[퐐] = − ∑ ( ∏ 퐐(푋푖)) ∙ ln ( 휙푐(Xc)) − ∑ 푆퐐(푋푖)

푋1,…,푋푛 푋푖∈Xc 1≤푖≤푛

퐹푀퐹({푝푖}) = − ∑ 푝푖 ∙ 푝푗 ∙ ln 휙푖푗(푋i, 푋푗) + ∑(푝푖 ∙ ln 푝푖 + (1 − 푝푖) ∙ ln(1 − 푝푖)) (3.83) 푖,푗 푖

Then, we are going to convert the Bethe free energy given by the equations (3.81) – (3.82) into a more constrained Gibbs free energy named the Bethe–Gibbs free energy. We do this by imposing additional constraints over the Bethe free energy 𝒢훽, specifically we minimize 𝒢훽 with respect to the parameters {푝푖푗}, then we solve {푝푖푗} exactly in terms of the parameters {푝푖}.

The minimization is done, as usually, by taking derivatives of the Bethe free energy with respect to {푝푖푗} and setting them to zero:

∂𝒢훽 푝푖푗 ∙ (푝푖푗 + 1 − 푝푖 − 푝푗) = −푤̂푖푗 + ln ( ) = 0 (3.84) ∂푝푖푗 (푝푖 − 푝푖푗) ∙ (푝푗 − 푝푖푗)

This can be simplified to a quadratic equation:

2 훼푖푗 ∙ 푝푖푗 − (1 + 훼푖푗 ∙ 푝푖 + 훼푖푗 ∙ 푝푗) ∙ 푝푖푗 + (1 + 훼푖푗) ∙ 푝푖 ∙ 푝푗 = 0 (3.85) where we have defined:

훼푖푗 = exp(푤̂푖푗) − 1 (3.86)

In addition to this equation we have to make sure that 푝푖푗 satisfies the following bounds:

max(0, 푝푖 + 푝푗 − 1) ≤ 푝푖푗 ≤ min(푝푖, 푝푗) (3.87)

These bounds can be understood by noting that all the parameters {푝푖} and {푝푖푗} cannot become negative by their definition. The following theorem guarantees the desired unique solution for {푝푖푗}.

61

Theorem 3.8:

There is exactly one solution to the quadratic equation (3.85) that minimizes the Bethe free energy and satisfies the bounds (3.87). The analytic expression of this solution is:

1 2 푝푖푗 = ∙ (푄푖푗 − √푄푖푗 − 4 ∙ 훼푖푗 ∙ (1 + 훼푖푗) ∙ 푝푖 ∙ 푝푗) 2 ∙ 훼푖푗 where: (3.88) 푄푖푗 = 1 + 훼푖푗 ∙ 푝푖 + 훼푖푗 ∙ 푝푗

Moreover, the parameters 푝푖푗 will never actually saturate one of the bounds.

The proof of this theorem is outside the scope of this paper. A proof can be found in [4].

Thus, by inserting the expression for {푝푖푗} given by (3.88) into the Bethe free energy given by

(3.82), we obtain the analytic expression of Bethe–Gibbs free energy 퐺훽 (also called the Gibbs free energy in the Bethe approximation). We are not going to provide the whole analytic expression for 퐺훽[{푝푖}], but a simpler expression that highlights the dependency 퐺훽 has on {푝푖} and how it had arrived there:

퐺훽[{푝푖}] = min 𝒢훽[{푝푖}, {푝푖푗}] = 𝒢훽[{푝푖}, {푝푖푗(푝푖, 푝푗)}] (3.89) {푝푖푗}

We observe that the mean field free energy 퐹푀퐹({푝푖}) given by (3.83) and the Bethe–Gibbs free energy 퐺훽[{푝푖}] given by (3.89) are similar in spirit, so they might behave similarly in approximate inference algorithms concerned with singleton marginals. In Section 3.5.4 we will elaborate upon this topic.

3.5.3 The relationship between belief propagation fixed–points and Bethe free energy

The belief propagation (BP) or the sum–product algorithm is an efficient local message passing algorithm for exact inference on trees or, generally, on graphs without cycles. The BP algorithm is guaranteed to converge to the correct marginal posterior probabilities in tree–like graphical models. The BP algorithm applied to a graph with loops is called loopy belief propagation (LBP). The LBP algorithm remains well–defined and, in some cases, gives good approximate answers, while in other cases gives poor results or fails to converge.

62

Yedidia, Freeman, and Weiss [28,35] established that, in a factor graph (see Definition 2.6), there is a one–to–one correspondence between the fixed–points of BP algorithms and the stationary points of the Bethe free energy. They also showed that the BP algorithms can only converge to a fixed–point that is also a stationary point of the Bethe approximation to the free energy. Their discovery, which has been heralded as a major breakthrough for belief propagation in general, not only clarified the nature of the Bethe approximation of the free energy, but also opened the way to construct more sophisticated message passing algorithms based on improvements made to Bethe’s approximation.

The theoretical result of Yedidia et al. [28,35] is applicable not only to factor graphs, but to all types of graphical models. The justification of this statement relies on two facts. The first fact is that all types of graphical models have the following property: they can be converted, before doing inference, into a pairwise Markov network, through a suitable clustering of nodes into larger nodes [54]. The second fact is that the pairwise Markov network a factor graph is converted into and the factor graph itself have the same joint probability distribution [59].

Therefore, without loss of generality, we can use a pairwise Markov network as the underlying graphical model for the BP algorithm. In this section we give only a briefly description of BP. Detailed presentations of BP can be found in [59-60]. We assume that we work under the hypothesis and notations given by (3.73). We use the standard set of notations for BP algorithms and, when applicable, we provide the correspondent in our notations. We use the notation 푑푖 for the degree of node 푖.

The standard BP update rules are applicable to the message that node 푖 sends to node 푗 denoted 푚푖푗 and to the belief of node 푖 denoted 푏푖:

푚푖푗(푋푗) ← 훼 ∙ ∑ 휙푖푗(푋i, 푋푗) ∙ 휙푖(푋i) ∙ ∏ 푚푘푖(푋푖) (3.90) 푋푖 푘∈ne(푖)−{푗}

푏푖(푋푖) ← 훼 ∙ 휙푖(푋i) ∙ ∏ 푚푘푖(푋푖) (3.91) 푘∈ne(푖) where 훼 denotes a normalization constant and ne(푖) denotes the Markov blanket of node 푖.

The belief 푏푖(푋푖) is obtained by multiplying all incoming messages to node 푖 by the local evidence. If the belief 푏푖(푋푖) is normalized, then it approximates the marginal probability

푝푖 = 푝푖(푋푖) given by (3.74).

63

The belief 푏푖푗(푋i, 푋푗) at the pair of connected nodes 푋i and 푋푗 is defined as the product of the local potentials and all incoming messages to the pair of nodes:

푏푖푗(푋i, 푋푗) ← 훼 ∙ 휓푖푗(푋i, 푋푗) ∙ ∏ 푚푘푖(푋푖) ∙ ∏ 푚푙푗(푋푗) (3.92) 푘∈ne(푖)−{푗} 푙∈ne(푗)−{푖} where: 휓푖푗(푋i, 푋푗) ≡ 휙푖푗(푋i, 푋푗) ∙ 휙푖(푋i) ∙ 휙푗(푋j) (3.93)

If the belief 푏푖푗(푋i, 푋푗) is normalized, then it approximates the marginal probability 푝푖푗 = 푝푖푗(푋푖푋푗) given by (3.74). Generally the beliefs 푏푖(푋푖) and 푏푖푗(푋i, 푋푗) are approximate marginals. However, they become the exact marginals when the underlying graph contains no cycles [13].

Theorem 3.9 (Yedidia et al. [28,35]):

Let {푚푖푗} be a set of BP messages and let {푏푖푗, 푏푖} be the beliefs calculated from those messages. Then the beliefs are fixed–points of the BP algorithm if and only if they are zero gradient points of the Bethe free energy 𝒢훽 and subject to the following normalization and marginalization constraints:

∑ 푏푖(푋푖) = 1 and ∑ 푏푖푗(푋i, 푋푗) = 푏푗(푋푗) (3.94) 푋푖 푋푖

Proof: We start by writing the Bethe free energy 𝒢훽 (3.82) in terms of beliefs:

𝒢훽({푏푖}, {푏푖푗}) = ∑ ∑ 푏푖푗(푋i, 푋푗) ∙ [ln 푏푖푗(푋i, 푋푗) − ln 휓푖푗(푋i, 푋푗)] 푖,푗 푋푖,푋푗 (3.95) − ∑(푑푖 − 1) ∙ ∑ 푏푖(푋푖) ∙ [ln 푏푖(푋푖) − ln 휙푖(푋i)]

푖 푋푖

To prove the claim " → " we add the following Lagrange multipliers to form a Lagrangian 퐿:

 휆푖푗(푋푗) is the multiplier corresponding to the constraint that 푏푖푗(푋i, 푋푗) marginalizes down to

푏푗(푋푗);

 휉푖푗 , 휉푖 are multipliers corresponding to the normalization constraints.

휕퐿 The equation: = 0 gives: 휕푏푖푗(푋i,푋푗)

ln 푏푖푗(푋i, 푋푗) = ln 휓푖푗(푋i, 푋푗) + 휆푖푗(푋푗) + 휆푗푖(푋푖) + 휉푖푗 − 1

64

휕퐿 The equation: = 0 gives: 휕푏푖(푋i)

(푑푖 − 1) ∙ (ln 푏푖(푋푖) + 1) = ln 휙푖(푋i) + ∑ 휆푗푖(푋푖) + 휉푖 푗∈ne(푖)

Setting:

휆푖푗(푋푗) = ln ∏ 푚푘푗(푋푗) 푘∈ne(푖)−{푗} and using the marginalization constraints (3.94), we find that the stationary conditions on the Lagrangian are equivalent to the BP fixed–point conditions.

To prove the claim " ← ", consider that we are given 푏푖, 푏푖푗, and 휆푖푗(푋푗) that correspond to a zero–gradient point and set:

푏푗(푋푗) 푚푖푗(푋푗) = exp (휆푖푗(푋푗))

Because 푏푖, 푏푖푗, and 휆푖푗(푋푗) satisfy the stationarity conditions, 푚푖푗 defined in this way must be a fixed–point of BP.

Since both sides of Theorem 3.9 are valid, there is a one–to–one correspondence between the fixed–points of the BP algorithm and the stationary points of the Bethe free energy.

The consequences of Theorem 3.9 are different for tree–like graphs respectively loopy graphs. In tree–like graphs the fixed–points of the BP algorithm are the global minima of the Bethe free energy [28]. This comes as a natural effect of the fact that the BP algorithm performs exact inference in graphs without loops, so the Bethe free energy is minimal for exact marginals.

In loopy graphs the situation is more complicated. In [61] Heskes showed that the stable fixed– points of the LBP algorithm are local minima of the Bethe free energy. He also showed that the converse is not necessarily true: minima of the Bethe free energy can be unstable fixed–points of LBP [61]. Furthermore, in [62] Heskes derived sufficient conditions for uniqueness of a LBP fixed–point. By using a particular Boltzmann machine as a counter–example, Heskes showed that the uniqueness of a LBP fixed–point does not guarantee the convergence of the LBP algorithm to that fixed–point [62].

65

In [58] Shin proposed an alternative solution to LBP fixing its convergence issue via double- looping. His solution applies to arbitrary binary graphical models of 푛 nodes and maximum degree in the underlying graph O(log 푛). Shin’s algorithm is a message passing algorithm that solves the Bethe fixed–point equations in polynomial number of bitwise operations and is considered the first fully polynomial–time approximation scheme for the LBP fixed–point computation in Markov random fields [58].

We end this section by rewriting the expression (3.83) of the mean field free energy in terms of beliefs:

퐹푀퐹({푏푖}) = − ∑ 푏푖(푋푖) ∙ 푏푗(푋푗) ∙ ln 휙푖푗(푋i, 푋푗) + ∑ 푏푖(푋푖) ∙ [ln 푏푖(푋푖) − ln 휙푖(푋i)] (3.96) 푖,푗 푖

3.5.4 Belief optimization

Unlike the mean field free energy and the Bethe–Gibbs free energy, which include only first order terms 푝푖(푋푖), the Bethe free energy includes first–order terms 푝푖(푋푖) as well as second– order terms 푝푖푗(푋i, 푋푗). Unlike the mean field free energy, which is minimized in the primal variables {푝푖}, the Bethe free energy can be minimized in both the primal space and the dual space. Usually the Bethe free energy is minimized in the dual space by using messages, which are a combination of the dual variables {휆푖푗(푋푗)}. The process of minimizing the Bethe free energy in the primal space, i.e., in terms of the posterior probability distributions, is similar to the mean field free energy minimization. The approximate inference algorithm that employs this type of minimization for the Bethe free energy is named belief optimization and represents an alternative to the fixed–point equations of belief propagation [4].

In order to derive the fixed–point equations that solve the marginals {푝푖} for the Bethe free energy, we follow a familiar recipe: first we compute derivatives of the Bethe free energy given by (3.89) with respect to {푝푖} and then we equate them to zero:

퐺훽[{푝푖}] = min 𝒢훽[{푝푖}, {푝푖푗}] = 𝒢훽[{푝푖}, {푝푖푗(푝푖, 푝푗)}] {푝푖푗}

푑퐺훽 휕𝒢훽 휕𝒢훽 휕푝푖푗 = + ∑ ∙ (3.97) 푑푝푖 휕푝푖 휕푝푖푗 휕푝푖 푗∈ne(푖) where ne(푖) denotes the Markov blanket of unit 푖.

66

We recall that in 퐺훽[{푝푖}] the pairwise marginals {푝푖푗} are defined in terms of the singleton

휕𝒢훽 marginals {푝푖}, otherwise = 0. Therefore, (3.97) becomes: 휕푝푖푗

푑퐺훽 휕𝒢훽 = (3.98) 푑푝푖 휕푝푖

The equation (3.98) shows that, under the current assumptions, the Bethe–Gibbs free energy

퐺훽 and the Bethe free energy 𝒢훽 have the same fixed–points. In order to solve the gradient, we use the analytic expression (3.82) of 𝒢훽.

푑푖−1 휕𝒢훽 (1 − 푝푖) ∙ ∏푗∈ne(푖)(푝푖 − 푝푖푗) = 휃 + ln ( ) (3.99) 푖 푑 −1 휕푝푖 푝푖 푖 ∙ ∏푗∈ne(푖)(푝푖푗 + 1 − 푝푖 − 푝푗)

Equating these equations to zero gives the following set of fixed–point equations for the Bethe–

Gibbs free energy 퐺훽 respectively the Bethe free energy [4]:

푝푖 ∙ (푝푖푗 + 1 − 푝푖 − 푝푗) 푝푖 = sigm (−휃푖 + ∑ ln ( )) for all 푖 ∈ 푉 (3.100) (1 − 푝 ) ∙ (푝 − 푝 ) 푗∈ne(푖) 푖 푖 푖푗

Regardless how they run, sequentially or in parallel, the fixed–point equations (3.100) are not guaranteed to decrease the Bethe free energy 𝒢훽 or to converge at all.

Similarly with the mean field approximation, we may achieve a decrease of the Bethe free energy by optimizing it only relative to a single coordinate in the space, otherwise temporarily fixing all neighboring marginals 푝푗 and minimizing over the central node 푝푖. The resulting value for the singleton marginal 푝푖 is an optimal value given the choice of all other singleton marginals. Furthermore, to minimize the function in its entirety, we need to minimize relative to all the coordinates {푝푖}. One way to achieve this goal is to embed the global minimization step (relative to all the coordinates) in an iterated coordinate descent algorithm, which repeatedly minimize a single marginal at a time, given fixed choices to all of the others.

Other way to perform the global minimization is to perform on all the coordinates {푝푖} simultaneously while enforcing the constraint that they stay within the interval [0,1].

67

Chapter 4. Introduction to Boltzmann Machines

A Boltzmann machine is a parallel computational organization or network that is well suited to constraint satisfaction tasks involving large numbers of “weak” constraints. A weak constraint is a goal criteria that may not be satisfied by any solution, otherwise is not an all–or– none criteria. In some problem domains, such as finding the most plausible interpretation of an image, it happens frequently that even the best possible solution violates some constraints. In these cases a variation of weak constraints is used, specifically weak constraints that incur a cost when violated. Furthermore, the quality of the solution is determined by the total cost of all the constraints that it violates [10-11,43].

4.1 Definitions

Structurally a Boltzmann machine is a symmetrical connectionist network with hidden units; therefore, its structure follows the general structure of a connectionist network described in Section 2.4. Hinton characterized the Boltzmann machine as “a generalization of a Hopfield network in which the units update their states according to a stochastic decision rule” [51]. The majority of following definitions are taken from [1,3,43] and appropriately adapted for consistence. Our focus in Chapter 4 and Chapter 5 is the asynchronous Boltzmann machine that we simply refer to as the Boltzmann machine.

Definition 4.1:

A Boltzmann machine is a neural network that satisfies certain properties. Formally a Boltzmann machine 퐁퐌 is a four–tuple:

퐁퐌 = (퓝, 퓖, 퐖̂ , 횯) (4.1) comprising of:

 a finite set 퓝 of primitive computing elements called units or neurons; Without restricting the generality we assume that 𝒩 is indexed by the set {1,2, … 푛} for 풏 = |𝒩|. To make the formulae more readable, in subsequent development we refer equivalently to 𝒩 and {1,2, … 푛}.

68

 an undirected graph (퓝, 퓖) called the connectivity graph, where:

퓖 = { {푖, 푗} ∶ (푖, 푗) ∈ 𝒩 × 𝒩, 푖 ≠ 푗} (4.2)

 a collection 퐖̂ = (푤̂푖푗){푖,푗}∈퓖 of real numbers called the weights or synaptic weights; each

weight 푤̂푖푗 is associated with an edge {푖, 푗};

 a collection 횯 = (휃푗)푗∈𝒩 of real numbers called the thresholds; each threshold 휃푗 is associated with an unit; and satisfying the following properties:

 a unit is always in one of two activation levels or states designated as on/off or 1/0 or 1/-1;  a unit adopts these activation levels as a probabilistic function of the activation levels of its neighboring units and the weights on its edges to them;

 the weights 푤̂푖푗 on the edges are symmetric having the same strength in both directions:

푤̂푖푗 = 푤̂푗푖 (4.3)

 the weights 푤̂푖푗 on the edges can take on real values of either sign;  a unit being on or off is taken to mean that the system currently accepts or rejects some elemental hypothesis about the domain;  the weight on an edge represents a weak pairwise constraint between two hypotheses:

 a positive weight indicates that the two hypotheses tend to support one another; if one is currently accepted, accepting the other should be more likely;  a negative weight suggests, other things being equal, that the two hypotheses should not both be accepted.

The following notions are important in subsequent development:

 The terms link and connection equally denote an edge {푖, 푗} ∈ 퓖 of the connectivity graph, where 1 ≤ 푖, 푗 ≤ 푛;  퐈 = {0,1} denotes the set of possible activation levels or states for a unit;

Hopfield represented the states of his model with -1 and 1 because his model was derived from a physical system called spin glass in which spins are either down or up. Provided the units have thresholds, models that use the representation -1 and 1 for their states can he translated into models that represent their states with 0 and 1 and have different thresholds

69

[43]. In Section 2.5 we showed a similar translation for a Hopfield network (equations (2.33) and (2.34)).

 𝝈퐢 ∈ I denotes the activation level of unit 푖, ∀푖 ∈ 𝒩;  ℝ퓖 denotes the set of all families of weights Ŵ ;  ℝ퓝 denotes the set of all families of thresholds Θ;  The connectivity graph (𝒩, 𝒢) can be extended to (퓝, 퓖′) as follows:

퓖′ = { {푖, 푗} ∶ (푖, 푗) ∈ 𝒢 or (푖 = 0 and 푗 ∈ 𝒩)} (4.4) and: ′ ℝ퓖 ≝ ℝ퓖 × ℝ퓝 (4.5)

 Parameters

The parameters or extended weights 퐖 are a collection of pairs of real numbers defined as:

′ 퐖 ≝ (퐖̂ , 횯) = (풘 ) ∈ ℝ퓖 풊풋 {풊,풋}∈퓖′ (4.6) where: (푤̂푖푗, 휃푗), if {푖, 푗} ∈ 𝒢 푤푖푗 ≝ { ′ (4.7) (−휃푗, 휃푗), if {푖, 푗} ∈ 𝒢 − 𝒢

푛∙(푛−1) The number of elements of Ŵ is at most equal to: , which corresponds to (𝒩, 𝒢) being 2 a complete undirected graph with 푛 vertices or units. 푛∙(푛−1) 푛∙(푛+1) The number of elements of W is at most equal to: + 푛 = , which corresponds 2 2 to (𝒩, 𝒢′) being a complete undirected graph with 푛 + 1 vertices or units.

Definition 4.2:

If we incorporate into Definition 4.1 the concept of parameters according with the formulae (4.6) and (4.7), then we obtain the following equivalent definition for Boltzmann machine:

퐁퐌 = (퓝, 퓖, (퐖̂ , 횯)) = (퓝, 퓖′, 퐖) (4.8)

 Configurations

A function 𝝈: 퓝 ⟶ 퐈, 𝝈(풊) ≝ 𝝈퐢 is called an 퐼–valued configuration of 𝒩.

A specification of activation levels (𝝈퐢)풊∈퓝 of all the units 푖 ∈ 𝒩 represents a configuration or a global state of 퐁퐌. A configuration of 퐁퐌 can also be seen as a particular combination of

70

hypotheses about the domain. The set of all possible configurations of 퐁퐌 represents the configuration space of 퐁퐌 and is written 퐈퓝.

 Net input of configuration towards unit

Generally, the net input of configuration 휎 towards a unit 푖 ∈ 𝒩, also called the activation potential of unit 푖, is defined by the equation (2.23). If we adapt the equation (2.23) to the current conventions and notations, we obtain:

퐧퐞퐭퐢 ≡ 퐧퐞퐭(풊, 𝝈) = −휃푖 + ∑ 푤̂푗푖 ∙ 휎푗 = ∑ 푤푗푖 ∙ 휎푗 (4.9) 푗∈퓖(푖) 푗∈퓖′(푖)

where: 𝒢(푖) = {푗 ∶ {푗, 푖} ∈ 𝒢}; 𝒢′(푖) = {푗 ∶ 푗 ∈ {0} ∪ 𝒩 and {푗, 푖} ∈ 𝒢}; and 휎푗 is the projection of 휎 onto the jth component on I𝒩. The net input of configuration 휎 towards a specific unit can be seen as a mapping between a given configuration with a given unit and ℝ. The mapping between the configuration alone and ℝ is called Hamiltonian. Formally, a Hamiltonian H is an element of 퓗(퓝), where ℋ(𝒩) represents the set of all real–valued functions defined on I𝒩:

퓗(퓝) = {H | H ∶ I𝒩 ⟶ ℝ} (4.10)

Clearly ℋ(𝒩) is a linear space of dimension 2푛.

 Probability functions on the configuration space

Let 퓟(퓝) denote the set of all probability functions on the configuration space I𝒩. 풫(𝒩) is a

퓝 simplex of dimension 2푛−1 in ℝ퐈 :

𝒩 퓟(퓝) = {퐏 | 퐏 ∶ I ⟶ [0,1], ∑ 퐏(휎) = 1} (4.11) 휎∈I𝒩

Let 퓟+(퓝) denote the interior of the simplex 풫(𝒩), i.e., the set of those 퐏 ∈ 풫(𝒩) that are nondegenerate, in the sense that 퐏(휎) ≠ 0 for all 휎 ∈ I𝒩:

𝒩 퓟+(퓝) = {퐏 ∈ 풫(𝒩) | 퐏(휎) ≠ 0 for all 휎 ∈ I } (4.12)

 Gibbs measure associated to a Hamiltonian

Any element H ∈ ℋ(𝒩) gives rise to a probability distribution 퐆퐇 ∈ 퓟+(퓝) named the Gibbs measure associated to the Hamiltonian H and defined by:

exp(H(휎)) 퐆 (𝝈) = (4.13) 퐇 푍(H)

71

where 푍(H) is the normalization constant needed to make the probabilities add up to 1, i.e.:

풁(퐇) = ∑ exp(H(휎)) (4.14) 휎∈I𝒩

𝒩 Clearly the set of Gibbs measures on I is exactly 풫+(𝒩).

Two Hamiltonians H1 ≠ H2 give rise to the same Gibbs measure if and only if they differ by a

constant. Let 퓗ퟎ(퓝) be the quotient space of ℋ(𝒩) modulo the constants. Then the

function 풇ퟎ defined by the equation (4.15) is well defined and bijective:

풇ퟎ ∶ 퓗ퟎ(퓝) ⟶ 퓟+(퓝), 풇ퟎ(퐇) = 퐆퐇 (4.15)

 Quadratic Hamiltonian

th 𝒩 For any 푖 ∈ 𝒩, let 휎푖 denote the projection of 휎 onto the i component on I .

Let also 휎0 denote the function identically equal to 1: 휎0 ≝ 1. ̂ 퓖′ Then for any pair (W, Θ) = W ∈ ℝ we associate a function 퐇(퐖̂ ,횯) ≝ 퐇퐖 named the Hamiltonian of W = (Ŵ , Θ) and defined by:

퓝 퐇 ̂ ∶ 퐈 ⟶ ℝ, 퐇 ̂ (𝝈) = ∑ 휎푖 ∙ 휎푗 ∙ 푤̂푖푗 − ∑ 휎푗 ∙ 휃푗 (퐖,횯) (퐖,횯) (4.16) {푖,푗}∈𝒢, 푗∈𝒩 푖<푗 equivalent to: 1 퐇 (𝝈) = ∙ ∑ 휎 ∙ 휎 ∙ 푤̂ − ∑ 휎 ∙ 휃 (퐖̂ ,횯) 2 푖 푗 푖푗 푗 푗 (4.17) {푖,푗}∈𝒢 푗∈𝒩 equivalent to: 퓝 퐇퐖 ∶ 퐈 ⟶ ℝ, 퐇퐖(𝝈) = ∑ 휎푖 ∙ 휎푗 ∙ 푤푖푗 (4.18) {푖,푗}∈𝒢′ 푖<푗

퓖′ Any Hamiltonian H which is of the form HW for some W ∈ ℝ is called a quadratic Hamiltonian.

 Partition function and cumulant function

The partition function associated to a Hamiltonian H is the function 풁 defined by:

풁 ∶ 퓗(퓝) ⟶ ℝ, 풁(퐇) = ∑ exp(H(휎)) (4.19) 휎∈I𝒩

The partition function of a quadratic Hamiltonian HW is denoted by the simplified notation:

풁(퐖) ≝ 풁(퐇퐖) = ∑ exp(HW(휎)) (4.20) 휎∈I𝒩

72

The partition function 푍 of a quadratic Hamiltonian HW is well defined and strictly convex on

ℋ0(𝒩) and, generally, intractable. Therefore, we need to approximate it and this is where the cumulant function helps.

By definition, the cumulant function 푨 of a quadratic Hamiltonian HW is the natural logarithm of the partition function associated to that Hamiltonian:

푨 ∶ 퓗(퓝) ⟶ ℝ, 푨(퐖) ≝ 푨(퐇퐖) = ln 푍(W) (4.21)

When trying to approximate a probability distribution, it is more important to get the probabilities correct for events that happen frequently than for rare events. One way to accomplish this objective is to operate with logarithms of probabilities instead of directly with probabilities.

 Quadratic Gibbs measure

We introduce a simplified notation for the Gibbs measure GHW associated to a Hamiltonian 퓖′ HW, which itself is associated to a given set of parameters W = (Ŵ , Θ) ∈ ℝ :

exp(H (휎)) exp(H (휎)) ( ) ( ) W W 𝒩 (4.22) 퐆퐖 𝝈 ≝ 퐆퐇퐖 𝝈 = = for all 휎 ∈ I 푍(HW) 푍(W)

A quadratic Gibbs measure on I𝒩 associated to a connectivity graph (𝒩, 𝒢) is a probability function 퐏 ∈ 풫(𝒩) that satisfies the following property:

퓖′ ∃W = (Ŵ , Θ) ∈ ℝ such that 퐏 ≡ 퐆퐖

We introduce the notation 퐆ퟐ(퓝, 퓖) to designate the set of all quadratic Gibbs measures on I𝒩:

퓖′ 퐆ퟐ(퓝, 퓖) = {퐏 ∈ 풫(𝒩) ∶ ∃W = (Ŵ , Θ) ∈ ℝ such that 퐏 ≡ 퐆퐖} (4.23)

Clearly, quadratic Gibbs measures on I𝒩 are very special, since they are parameterized by 푛∙(푛+1) W ∈ ℝ퓖′, i.e., by maximum parameters, whereas 풫(𝒩) has dimension 2푛−1 and 2 푛∙(푛+1) ≪ 2푛−1. If we consider, in addition, probability distributions that are marginals of 2 Gibbs measures, we get all the Gibbs measures on I𝒩.

73

4.2 Modelling the underlying structure of an environment

By differentiating their roles in the learning process, Hinton partitioned the units of a Boltzmann machine into two functional groups: a nonempty set of visible units and a possible empty set of hidden units. This is how Hinton explained in [10] the reason for this partition:

Suppose that the environment directly and completely determines the states of a subset of the units (called the "visible" units), but leaves the network to determine the states of the remaining "hidden" units. The aim of the learning is to use the hidden units to create a model of the structure implicit in the ensemble of binary state vectors that the environment determines on the visible units.

A more detailed justification for the differentiation between units is given by Hinton in [10]. He considers a parallel network like Boltzmann machine a “pattern completion device such that a subset of the units are “clamped” into their on or off state and the weights in the network then complete the pattern by determining the states of the remaining units” [10]. Hinton comes up with an example of a network that has one unit for each component of the environmental input vector; such a network is capable to learn only a limited set of binary vectors. He uses this example to explain why his assumption about pattern completion has strong limitations and how these limits can be transcended: by using extra units whose states do not correspond to components in the vectors to be learned [43]:

The weights on connections to these extra units can be used to represent complex interactions that cannot be expressed as pairwise correlations between the components of the environmental input vectors.

He calls these extra units hidden units and the units that are used to specify the patterns visible units. In [43] Hinton gives the following intuitive explanation for the separation of units in two classes:

The visible units are the interface between the network and the environment that specifies vectors for it to learn or asks it to complete a partial vector. The hidden units are where the network can build its own internal representations.

Formally this split–operation of 𝒩 can be described as:

퓝 = 퓥 ∪ 퓗 and 퓥 ∩ 퓗 = ∅ (4.24) where 퓥 represents the set of visible units and 퓗 represents the set of hidden units.

Let 풎 be the number of units in 풱 and 풍 the number of units in ℋ:

74

푛 = 푚 + 푙, 푚 = |풱|, 푙 = | ℋ| (4.25)

Theoretically, the structure of an environment can be specified by giving the probability distribution over all 2푚 states of the 풱 visible units. Practically, the network is said to have a perfect model of the environment if it achieves exactly the same probability distribution over these 2푚 states when is running freely at thermal equilibrium with all units unclamped so there is no environmental input [10].

We can regard I𝒩 as the Cartesian product of I풱 and Iℋ and each configuration 휎 ∈ I𝒩 as a pair of configurations over the visible respectively hidden units:

퐈퓝 = 퐈퓥 × 퐈퓗 (4.26)

𝝈 = (풗, 풉) for 푣 ∈ I풱 and ℎ ∈ Iℋ (4.27)

If 퐏 ∈ 풫(𝒩), we use 퐌퐀퐑퐆(퐏, 풱) to denote the marginal of the probability distribution 퐏 with respect to the variables 휎푖 such that 휎푖 = 푣푖 for all 푖 ∈ 풱, i.e., the measure given by:

풱 퐌퐀퐑퐆(퐏, 풱) ≡ 퐌퐀퐑퐆(퐏, 풱)(푣) = ∑ 퐏(푣, ℎ) for 푣 ∈ I (4.28) ℎ∈Iℋ

Given a connectivity graph (𝒩, 𝒢), we introduce the notation 퐆ퟐ(퓥, 퓗, 퓖) to designate the set of all probability measures 퐐 on 풱 that are marginals of some quadratic Gibbs measure 퐏 ∈

G2(𝒩, 𝒢), i.e., satisfy 퐐 ≡ MARG(퐏, 풱) for some 퐏 ∈ G2(𝒩, 𝒢). We also recall the definition (4.23) 𝒩 of 퐆ퟐ(퓝, 퓖), i.e., the set of all quadratic Gibbs measures on I .

퓖′ 퐆ퟐ(퓝, 퓖) = {퐏 ∈ 풫(𝒩) ∶ ∃W ∈ ℝ such that: 퐏 ≡ 퐆퐖} (4.29) 퐆ퟐ(퓥, 퓗, 퓖) = {퐐 ∈ 풫(풱) ∶ ∃퐏 ∈ G2(𝒩, 𝒢) such that: 퐐 ≡ MARG(퐏, 풱)}

The following theorem mentioned in [3] establishes a relation between 풫+(풱) and G2(풱, ℋ, 𝒢).

Theorem 4.1:

Let (𝒩, 𝒢) be the full connectivity graph of a 퐁퐌. Using the notations (4.25), let assume that:

1 푙 ≥ 2푚 − ∙ (푚2 + 푚) − 1 2 Then: (4.30) G2(풱, ℋ, 𝒢) = 풫+(풱)

This means that every nondegenerate probability distribution on I풱 can be realized as a marginal of a distribution on I𝒩 with a quadratic Hamiltonian.

75

In view of this result, if we are trying to model a probability distribution 퐐 on I풱, it is not too much of a restriction to assume that 퐐 is a marginal of some quadratic Gibbs measure on some larger set 𝒩 = 풱 ∪ ℋ of units. In particular, if we only look at the visible units, then the equilibrium behavior of a Boltzmann machine is a marginal of a quadratic Gibbs measure. Moreover, every quadratic Gibbs measure arises from a Boltzmann machine and then Theorem 4.1 implies that every nondegenerate probability measure on I풱 arises as the behavior of some Boltzmann machine, possibly with hidden units.

The connectivity graph (𝒩, 𝒢) of a Boltzmann machine proposed by Hinton in [8-11,43] is a general undirected graph, which means that, as a graphical model, a general Boltzmann machine represents a pairwise Markov random field. Its connectivity graph can be any undirected graph, in particular a complete undirected graph that was described by Hinton as a fully–connected Boltzmann machine. However, the majority of research on Boltzmann machines has been done on a particular type of graph, specifically a graph that can be “decomposed” into layers. This particular graph structure is named in the machine learning literature the generic Boltzmann machine or simple the Boltzmann machine. Because the topic of this paper – learning algorithms for Boltzmann machines – originates in the field of machine learning, we adhere to their concept of Boltzmann machine.

The generic Boltzmann machine has one layer which contains all the fully–interconnected visible units and at least one layer that contains fully–interconnected hidden units. If the hidden units are distributed among multiple layers, only one hidden layer is connected, specifically fully–connected, with the visible layer. The other hidden layers are interconnected, that means each hidden layer is fully connected with the layer “below” and fully–connected with the layer “above” (if that exists). The visible layer is placed “below” the first hidden layer. Figure 1 illustrates two Boltzmann machine configurations: the fully–connected Boltzmann machine and the generic Boltzmann machine.

76

Figure 1 a) A fully–connected Boltzmann machine with three visible nodes and four hidden nodes; b) A layered Boltzmann machine with one visible layer and two hidden layers.

4.3 Representation of a Boltzmann Machine as an energy– based model

As constraint satisfaction networks, the Boltzmann machines should be well equipped to deal with tasks that involve a large number of weak constraints. However, what we have learned about them so far doesn’t endow them with such qualities. Specifically the hidden units, seen as hidden latent causes, are not good at modelling constraints between variables. Hidden ancestral variables, i.e., the variables corresponding to the hidden units, may be good for modelling some types of correlation, but they cannot be used to decrease variance. A better way to model constraints is to use an energy–based model that associates high energies with data vectors that violate constraints.

Inspired by a variant of Hopfield’s network, which we described in Section 2.5, Hinton had shown that there exists an expression for the energy of a configuration of the network such that, under certain circumstances, the individual units act so as to minimize the global energy. In [10] Hinton explained the importance of the energy of a parallel system like the Boltzmann machine: it represents the degree of violation of the constraints between hypotheses and consequently determines the dynamics of the search. He also formulated the following postulates or

77

assumptions about the energy, which he later used to derive the main properties of the probabilistic system that is Boltzmann machine.

Postulate 1:

There is a potential energy function over states of the whole system which is a function 푓(퐏(휎)) of the probability of a state 휎.

This is equivalent to saying that, given any input, a particular state or configuration 휎 of a Boltzmann machine has exactly one probability. It does not, for instance, have a probability of 0.3 and also a probability of 0.5.

Postulate 2:

The potential energy function is additive for independent systems. Since the probability for a combination of states of independent systems is multiplicative, it follows that:

푓(퐏(휎)) + 푓(퐏(휎′)) = 푓(퐏(휎)퐏(휎′))

The only function that satisfies this equation is:

푓(퐏(휎)) = 푘 ∙ ln 퐏(휎)

To make more probable states have lower energy the real–valued constant 푘 must be negative.

Postulate 3:

The part of the potential energy contributed by a single unit can be computed from information available to the unit.

Only potential energies symmetrical in all pairs of units have this property, since in this case a unit can "deduce" its effect on other units from their effect on itself.

Under the previous assumptions, the individual units of a Boltzmann machine can be made to act so as to minimize the global energy. If some of the units are clamped into particular states to represent a particular input, the system will then try to find the minimum energy configuration that is compatible with that input. Thus, the energy of a configuration can be interpreted as the

78

extent to which that combination of hypotheses fails to fit the input and violates the constraints implicit in the problem domain. So, in minimizing energy, the system is maximizing the extent to which a perceptual interpretation fits the data and satisfies the constraints. Consequently the system evolves towards interpretations of that input that increasingly satisfy the constraints of the problem domain [10].

Using the previous notations, the global energy of the system, also referred as the energy of the configuration 휎 of the system, is defined as:

퐸(휎) = − ∑ 휎푖 ∙ 휎푗 ∙ 푤̂푖푗 − ∑ 휎푖 ∙ 휃푖 {푖,푗}∈𝒢 푖∈𝒩 ( 푖<푗 ) or: (4.31)

1 퐸(휎) = − ( ∙ ∑ 휎 ∙ 휎 ∙ 푤̂ − ∑ 휎 ∙ 휃 ) 2 푖 푗 푖푗 푖 푖 {푖,푗}∈𝒢 푖∈𝒩 equivalent to: 퐸(휎) = −H(Ŵ,Θ)(휎) = −HW(휎) (4.32)

If we represent the configuration 휎 as a 푛–dimensional column vector, the weights Ŵ as a 푛 × 푛 symmetric matrix, and the thresholds as a 푛–dimensional column vector, then the energy of configuration 휎 can be written in a matrix form as:

1 퐸(휎) = − ( ∙ 휎T × Ŵ × 휎 − ΘT × 휎) (4.33) 2

We observe that the global energy defined by the equations (4.31) belongs to the Hamiltonian family given by the definition (4.10). More, the global energy is the negative of the quadratic Hamiltonian defined by the equations (4.16) and (4.17).

In Section 2.5 we presented the Hopfield update rule: switch each randomly selected unit into whichever of its two states yields the lower total energy given the current configuration of the network. If the Boltzmann machine had operated according with the Hopfield update rule, then it would have been no different than a multilayer perceptron that follows the same rule, otherwise would have suffered from the standard weakness of gradient descent methods: it could get stuck at a local minimum that is not a global minimum [10-11]. This is an inevitable consequence of only allowing jumps to states of lower energy or so called “downhill moves”. Unlike the Boltzmann machine, the Hopfield’s network doesn’t suffer from this weakness because its local energy minima are used to store memories. Therefore, if the Hopfield network

79

is started near some local minimum, the desired behavior is to fall into that local minimum and not to find the global minimum.

Hinton realized that, if jumps to higher energy states occasionally occurred, it would be possible for the system to break out of local minima, but it was not clear to him how the system would then behave and also when the uphill steps should be allowed [43]. Therefore, in order to escape the local minima in a Boltzmann machine, Hinton advanced the following idea: make the binary units stochastic and add thermal noise to the global energy such that, occasionally, it would lead to uphill steps. Hence, Hinton proposed that the stochastic unit should update its state based on its previous state according with the following rule: the ith unit of a configuration 휎 at time 푡 outputs the state 0 or the state 1 with probability:

1 퐏(휎 ) = 푖 Δ퐸 (4.34) 1 + exp ( 푖) 퐓 where: 퐓 is the pseudo–temperature, i.e., a parameter which models the thermal noise injected into the system; and Δ퐸푖 is the energy gap between the current state and the previous state of the ith unit of a configuration 휎.

Hinton used a simulated annealing algorithm to guide the increase in the level of thermal noise. He studied experimentally the effect of thermal noise over transition probabilities and came up with an annealing schedule that starts with a higher pseudo–temperature and gradually reduces it to a pseudo–temperature of 1. He based his annealing schedule on the following observations: at low pseudo–temperatures there is a strong bias in favor of states with low energy, but the time required to reach equilibrium may be longer; at higher pseudo– temperatures the bias is not so favorable, but equilibrium is reached faster [43]. According with Hinton, this technique cannot guarantee that a global minimum will be found, but it can guarantee that a nearly global minimum can be found with high probability.

Later, Hinton refined his original algorithm rule by adopting, during each annealing stage, i.e., when the pseudo–temperature is kept constant, a variant of the Metropolis algorithm. He also proposed a simplified version of the update rule (4.34): “if the energy gap between the on and th off states of the i unit of a configuration 휎 is 훥퐸푖′, then, regardless of the previous state of that unit, set the unit to 1 with a probability given by formula” (4.35):

1 퐏(휎 = 1) = 푖 E(휎 = 1) − 퐸(휎 = 0) 1 + exp ( 푖 푖 ) 퐓

80

1 퐏(휎 = 1) = 푖 E(휎 = 0) − 퐸(휎 = 1) 1 + exp (− 푖 푖 ) 퐓

1 퐏(휎 = 1) = 푖 −Δ퐸 ′ (4.35) 1 + exp ( 푖 ) 퐓 where: ′ Δ퐸푖 = 퐸(휎푖 = 0) − 퐸(휎푖 = 1)

Hinton found inspiration in Boltzmann’s work, specifically the principle that a network consisting of a large number of units, with each unit interacting with neighbouring units, will approach a canonical distribution at equilibrium given by the Boltzmann–Gibbs distribution. Although the development of Boltzmann machines has been motivated by ideas from statistical physics, they are nevertheless neural networks. Therefore, the following two differences, which we mentioned in the context of Markov random fields, should be carefully noted. Firstly, in neural networks the parameter 퐓 plays the role of a pseudo–temperature that has no physical meaning. Secondly, in neural networks the Boltzmann’s constant 퐤 can be taken equal to 1.

The local nature of the update rules (4.34) and (4.35) ensures that raising the noise level is equivalent to decreasing all the energy gaps between configurations, so in thermal equilibrium the relative probability of two configurations 휎 and 휎′ is determined solely by their energy difference and follows a Boltzmann distribution:

퐏(휎) 퐸(휎) − 퐸(휎′) = exp (− ) (4.36) 퐏(휎′) 퐓 where: 휎 and 휎′ are two configurations of a Boltzmann machine; 퐏(휎) is the probability of the Boltzmann machine to have the configuration 휎; 퐸(휎) is the energy of the configuration 휎; and 퐓 is the pseudo–temperature.

Hinton’s justification for this heuristic was the fact that energy barriers are what prevent a system from reaching equilibrium rapidly at low pseudo–temperature and, if the energy barriers can be suppressed or at least surpassed, equilibrium can be achieved rapidly at a pseudo– temperature at which the distribution strongly favors the lower minima [43]. However, in Hinton’s opinion, the energy barriers cannot be permanently removed because they correspond to states that violate the constraints and the energies of these states must be kept high to prevent the system from settling into them. Thereby, a solution to surpass the energy barriers between low– lying states was needed. Hinton realized that, in a with a high–dimensional state space like the Boltzmann machine, the energy barriers between low–lying states are

81

highly degenerate, so the number of ways of getting from one low–lying state to another is an exponential function of the height of the barrier one is willing to cross [43]. Thus, the effect of either one of the update rules (4.34) and (4.35) is the opening of an enormous variety of paths for escaping from a local minimum and, even though each path by itself is unlikely, it is highly probable that the system would cross the energy barrier between two low–lying states [43].

4.4 How a Boltzmann Machine models data

As a binary pairwise Markov network, a Boltzmann machine operates with probabilities, specifically it associates to each input a probability distribution over the output. Hence, we are looking at two categories of data that are essential for this type of network: environmental data and probabilities distributions associated with the underlying graphical model.

 The input or environmental data consist of a set of binary vectors. Each input vector is mapped one–to–one to the set of visible units, in this way producing a configuration over the visible units. The problem that we need to address regarding these configurations is to fit a model that will assign a probability to every possible configuration over the visible units. The formulae (4.25) show that there are 2푚 such configurations. Knowing this probability distribution would allow us to decide whether other binary vectors come from the same distribution. The network is said to have a perfect model of the environment if it achieves exactly the same probability distribution over these 2푚 configurations when it is running freely at thermal equilibrium with no environmental input.

In order to allow the network to approach thermal equilibrium, Hinton makes the assumptions that each of the environmental input vectors persists for long enough and the structure in the sequence of environmental vectors, if any, should be ignored [11]. The distribution over all visible configurations 푣 is nothing else than the marginal distribution over all the configurations 휎 of the network.

 The probability distributions associated with the underlying graphical model can be divided into three categories: joint configuration probabilities, conditional probabilities, and marginals. To define these distributions, we follow an approach similar with the approach we used in Section 4.3 for the individual unit.

 Joint configurations probabilities:

82

Let consider the configuration 휎 of visible units 푣 and hidden units ℎ given by the equation (4.27). Such a configuration is often called the “joint configuration” of 푣 and ℎ. The probability of the joint configuration 휎 is related to the energy of that configuration, which is given by the equations (4.31). Therefore, we start by evidencing 푣 and ℎ in (4.31):

퐸(휎) = 퐸(푣, ℎ) = − ∑ 휎푖 ∙ 휎푗 ∙ 푤̂푖푗 + ∑ 휎푖 ∙ 휃푖 (4.37) {푖,푗}∈𝒢 푖∈𝒩 푖<푗

In Section 4.3 we have learned that, in a Boltzmann machine, the energy of a configuration 휎 can be seen as a real function defined on I𝒩; therefore, according to the definition (4.10), it belongs to the Hamiltonian family. More, the energy of a configuration 휎 is the negative of a quadratic Hamiltonian. Then, the logical steps we need to follow to obtain the expression of the probability of a joint configuration 휎 = (푣, ℎ) are the same steps we followed in Section 4.1 to define the Gibbs measure associated to a Hamiltonian. We start by looking at how the energies of joint configurations are related to their probabilities and we identify two ways in which they are logically connected:

 In one way we can define the probability of a joint configuration 휎 = (푣, ℎ) by using an exponential model similar to one used in the definition (4.13):

퐏(푣, ℎ) ∝ exp(−퐸(푣, ℎ)) (4.38)

 In other way we can define the probability of a joint configuration 휎 = (푣, ℎ) to be the probability of finding the network in that particular joint configuration after we have updated all of the stochastic binary units many times. Thus, the probability of a joint configuration over both visible and hidden units depends on the energy of that joint configuration compared with the energy of all other joint configurations. This approach also follows the definition (4.13).

To comply with both requirements, we need to specify a normalization factor in the first definition that is compatible with the second definition and this is exactly the partition function defined by (4.14):

푍 = ∑ exp(−퐸(푢, 푔)) (4.39) 푢∈퐼풱, 푔∈퐼ℋ

83

Therefore, the probability of a joint configuration (푣, ℎ) of visible and hidden units is defined as:

exp(−퐸(푣, ℎ)) exp(−퐸(푣, ℎ)) 퐏(푣, ℎ) = = (4.40) 푍 ∑푢∈퐼풱,푔∈퐼ℋ exp(−퐸(푢, 푔))

 Conditional probabilities:

The conditional distributions over hidden and visible units are given by:

퐏(ℎ푗 = 1 | 푣, ℎ−푗) = sigm(∑ 푣푖 ∙ 푤̂푖푗 + ∑ ℎ푚 ∙ 푤̂푚푗 − 휃푗) (4.41) 푖∈풱 푚∈ℋ−{j}

퐏(푣푖 = 1 | ℎ, 푣−푖) = sigm(∑ ℎ푗 ∙ 푤̂푗푖 + ∑ 푣푘 ∙ 푤̂푘푖 − 휃푖) (4.42) 푗∈ℋ 푘∈풱−{i}

th where 푥−푖 denotes a vector 푥 with the i component 푥푖 omitted and sigm is the .

 Marginal probabilities:

The probability of a configuration 푣 of the visible units is the sum of the probabilities of all the joint configurations that contain it and it is identical with the marginal distribution over the configuration 푣 of visible units. It is computed with the following formula:

∑ ℋ exp(−퐸(푣, ℎ)) 퐏(푣) = ℎ∈I (4.43) ∑푢∈퐼풱,푔∈퐼ℋ exp(−퐸(푢, 푔))

The formulae (4.40) to (4.43) show that all the distributions specific to a generic Boltzmann machine are intractable. The main reason for their intractability is the computation of the partition function 푍 given by the equation (4.39).

4.5 General dynamics of Boltzmann Machines

Hinton investigated the behaviour of the Boltzmann machine under either one of the update rules (4.34) and (4.35) in two scenarios: when approaching the thermal equilibrium and after the thermal equilibrium was reached. To help understand the concept of thermal equilibrium, he suggested an intuitive way to think about it which is inspired by the idea behind the equation (2.7):

84

Imagine a huge ensemble of systems that all have exactly the same energy function; then the probability of a global state of the ensemble is just the fraction of the systems that have the corresponding state.

Similarly, reaching thermal equilibrium in a Boltzmann machine does not mean that the system has settled down into the lowest energy configuration, but that the probability distribution over configurations settles down to the stationary distribution. Hinton came up with an algorithm to describe the dynamics of the Boltzmann machine. We are going to give firstly an intuitive description of the algorithm (Algorithm 4.1) and then we are going to present the algorithm formally (Algorithm 4.2).

Algorithm 4.1: Boltzmann Machine Dynamics 1

Given: W , 퐓

begin

Step 1: Start with any distribution over all the identical units.

We could start with all the units in the same state or with an equal number of units in each possible state.

Step 2: Keep applying the stochastic update rule (4.34) to pick the next state for each randomly selected individual unit.

Step 3: After running the units stochastically in the right way, the network may eventually reach a situation where the fraction of units in each state remains constant.

This is the stationary distribution that physicists call thermal equilibrium.

Step 4: After reaching the thermal equilibrium, any given unit keeps changing its state but the fraction of units in each state does not change. Otherwise, once equilibrium has been reached, the number of units that “leave” a configuration at each time step will be equal to the number of units that “enter” that configuration.

end

85

We start our approach to formally define the dynamics of the Boltzmann machine by establishing the assumptions the network works under:

 the pseudo–temperature is 퐓;  time is discrete and is represented by the set of nonnegative integers 풯 = {0,1,2, … };  at each time 푡 ∈ 풯 one unit 푖 ∈ 𝒩 is selected at random for a possible update.

We observe that, if a Boltzmann machine has the weights and thresholds varying in time, deterministically or stochastically, then the Boltzmann machine becomes a particular case of a time–varying neural network, whose definition follows.

Definition 4.3:

A time–varying Boltzmann machine 퐓퐕퐁퐌 is a 퐁퐌 that has a fixed 𝒩, a fixed 𝒢, and whose parameters W = (Ŵ , Θ) can vary in time, either deterministically or stochastically.

퓖 Precisely, a 퐓퐕퐁퐌 is a four–tuple (퓝, 퓖, 퐖̂ , 횯) where 퐖̂ = {퐖̂ (푡)}푡∈풯 and 횯 = {횯(푡)}푡∈풯 are ℝ – valued respectively ℝ퓝–valued stochastic processes.

With respect to Definition 4.3, we make the following remarks:

 If 퐁퐌 is a 퐓퐕퐁퐌, then the net input computed according with the equation (4.9) by using the weights Ŵ (푡) and the thresholds Θ(푡) is called the net input at time 푡 and is denoted

퐧퐞퐭퐭(풊, 𝝈).  The unit 푖 selected for update at moment 푡 and denoted 푖(푡) finds out what its new state is going to be by computing two quantities: its net input at time 푡 and the probability given by the update rule.  The update rules (4.34) and (4.35) become time–varying as well, so they need to be changed to reflect the time factor. However, before we modify the general update rule (4.34) to reflect the time factor, we need to rewrite the rule to highlight the states of the ith unit at two consecutive moments: 푡 − 1 and 푡.

Claim: Given a configuration 휎 and a unit 푖, the update rule (4.34) can be written as:

86

1 퐏(휎푖(푡) | 휎푖(푡 − 1)) = −(2 ∙ 휎 (푡) − 1) ∙ net (푖, 휎(푡 − 1)) (4.44) 1 + exp 푖 t ( 퐓 )

Proof: We start by looking at the denominator in the formula (4.34), specifically at the term Δ퐸푖. We know from the equation (4.32) that the energy of a configuration is the negative of the quadratic Hamiltonian of that configuration, so we are focusing our attention on the quadratic Hamiltonian given by the equation (4.16).

Given a configuration 휎 and a unit 푖, we can split the quadratic Hamiltonian associated to the configuration 휎 into two terms such that one term reflects the contribution of the unit 푖 and other term H′ incorporates the contribution of all the units except 푖. We observe that the term that reflects the contribution of unit 푖 to the quadratic Hamiltonian is related to the net input to unit 푖 defined by the equation (4.9). Therefore, we can write:

′ HW(휎) = net(푖, 휎) ∙ 휎푖 + H (4.45)

In particular, if 휎(푖) denotes the configuration obtained from 휎 by switching the value of the ith unit, then we can compute the variation of the quadratic Hamiltonian corresponding to this operation:

(푖) (푖) (푖) ′ ′ ΔH = HW(휎 ) − HW(휎) = net(푖, 휎 ) ∙ 휎푖 + H − net(푖, 휎) ∙ 휎푖 − H

(푖) (푖) ΔH = net(푖, 휎 ) ∙ 휎푖 − net(푖, 휎) ∙ 휎푖 (4.46)

A few observations have to be made with respect to the equation (4.46).

 Firstly, the net input of a configuration towards a unit (equation (4.9)) doesn’t depend on the state of that unit. This means that two configurations that differ only in one and the same unit have exactly the same net input to that unit, assuming that the parameters of the network are the same. This fact translates to:

net(푖, 휎(푖)) = net(푖, 휎) (4.47)

Accordingly, the equation (4.46) can be rewritten as:

(푖) ΔH = net(푖, 휎) ∙ (휎푖 − 휎푖) = net(푖, 휎) ∙ Δ휎푖 (4.48)

 Secondly, switching the value of the ith hypothesis of a configuration that is also an I–valued configuration, where I = {0,1}, is the same as applying the following formula to that hypothesis:

87

(푖) (푖) 휎푖 = 1 − 휎푖 ⇔ 휎푖 = 1 − 휎푖 (4.49)

Accordingly, the equation (4.48) can be written in an equivalent form using only the new (푖) state 휎푖 given by the equation (4.49):

(푖) (푖) (푖) ΔH = net(푖, 휎) ∙ (휎푖 − 1 + 휎푖 ) = net(푖, 휎) ∙ (2 ∙ 휎푖 − 1) (4.50)

 Thirdly, we compute Δ퐸푖 :

(푖) (푖) Δ퐸푖 = 퐸(휎 ) − 퐸(휎) = −HW(휎 ) + HW(휎) = −ΔH

(푖) Δ퐸푖 = −net(푖, 휎) ∙ Δ휎푖 = −(2 ∙ 휎푖 − 1) ∙ net(푖, 휎) (4.51)

If we substitute (4.51) into (4.34) we obtain:

1 퐏(휎푖(푡)) = −(2 ∙ 휎 (푡) − 1) ∙ net (푖, 휎(푡 − 1)) (4.52) 1 + exp 푖 t ( 퐓 )

The state of the ith unit at time 푡 depends only on its previous state. Therefore, the equation (4.52) becomes exactly (4.44).

After we incorporate all the remarks regarding Definition 4.3 into the formula (4.44), we obtain the following update rule for a 퐓퐕퐁퐌. This is the rule used by Algorithm 4.2.

1 퐏 (휎푖(푡)(푡) | 휎푖(푡)(푡 − 1)) = −(2 ∙ 휎 (푡) − 1) ∙ net (푖(푡) , 휎)) (4.53) 1 + exp 푖(푡) 푡−1 ( 퐓 )

Definition 4.4:

A Boltzmann Machine Dynamics (BMD) on a network 퐁퐌/퐓퐕퐁퐌 is a Markov chain {𝝈(풕)}풕∈퓣 with state space I𝒩 and whose transitions occur according to the following algorithm:

Algorithm 4.2: Boltzmann Machine Dynamics 2

Given: W , 퐓

begin

Step 1: repeat

88

Step 2: at each time 푡 ∈ 풯 one unit 푖(푡) is chosen at random from the set 𝒩 ∪ {0} with the probability: 1 푛+1

Step 3: if the unit 푖(푡) is the “0” unit then set: 휎(푡) = 휎(푡 − 1)

else

Step 4: compute the net input to unit 푖(푡):

푥 = nett−1(푖(푡), 휎)

Step 5: generate the candidate state 푦 for the unit 푖(푡):

푦 = 1 − 휎(푡 − 1)푖(푡)

1 Step 6: if 푦 = 0 then compute the probability: 퐏 = 푥 1+exp( ) 퐓

1 else compute the probability: 퐏 = 푥 1+exp(− ) 퐓

Step 7: if 푥 ∙ (2푦 − 1) > 0 then accept y as the state of unit 푖(푡):

휎(푡)푖(푡) = 푦

else accept y as the state of unit 푖(푡) with probability 퐏:

if random(0,1) < 퐏 then 휎(푡)푖(푡) = 푦

Step 8: 휎(푡)푗 = 휎(푡 − 1)푗 for any 푗 ≠ 푖(푡)

Step 9: until stopping criterion true

end

In Algorithm 4.2 the notation random(0,1) denotes a sample from a uniform distribution 풰[0,1]. Furthermore, we consider Boltzmann machines with inputs and their corresponding dynamics. Specifically we clamp certain units at certain values of the activation levels and do not allow them to switch. The set of units that are clamped at a particular time 푡 will be allowed to depend on 푡. Suppose we choose a subset 풞 ∈ 𝒩, which may depend on 푡, in which case we denote it by 풞(푡) ∈ 𝒩. Also suppose we have an “external input,” i.e., a process 푣 = {푣(푡)}푡∈풯 such that each 푣(푡) takes values in I풞(t).

89

Definition 4.5:

A Boltzmann Machine Dynamics with Inputs (BMDI) is a BMD process that follows Algorithm 4.2 except that in Step 2 the choice of 푖(푡) is limited to the set (𝒩 ∪ {0}) − 풞(푡); consequently 1 1 the probability of a particular unit being chosen is = . |(𝒩∪{0})−풞(푡)| 푛+1−|풞(푡)|

Therefore, Step 2 of the algorithm looks like this:

Step 2: at each time 푡 ∈ 풯 one unit 푖(푡) is chosen at random from the set 1 (𝒩 ∪ {0}) − 풞(푡) with the probability: 푛+1−|풞(푡)|

4.6 The biological interpretation of the model

In this section we present Hinton’s original argumentation to support the idea that Boltzmann machines bear resemblance to the brains; therefore, it is worth studying them. We start by presenting some of the facts that make the Boltzmann machine belong to the same general class of computation devices as the brain. Then we present some irreconcilable differences between Boltzmann machine and the cortex. Both categories of arguments (pro and contra) had been presented by Hinton in [10]. Before we present the arguments, we need to define two concepts, native to physiology, which we are going to use.

An action potential is a short–lasting event in which the electrical membrane potential of a cell rapidly rises and falls, following a consistent trajectory; otherwise is a propagated impulse.

An electrotonic potential is a non–propagated local potential, resulting from a local change in ionic conductance (e.g. synaptic or sensory that engenders a local current); when it spreads along a stretch of membrane, it becomes exponentially smaller (decrement).

 Similitudes between Boltzmann machine and the cortex:

 The cerebral cortex is relatively uniform in structure.  Different areas of cerebral cortex are specialized for processing information from different sensory modalities such as: visual cortex, auditory cortex, and somatosensory cortex. Other areas are specialized for motor functions. However, all these cortical areas

90

have a similar anatomical organization and are more similar to each other in cytoarchitecture than they are to any other part of the brain [10].  Many problems in vision, , associative recall, and motor control can be formulated as searches. The similarity between different areas of cerebral cortex suggests that the same kind of massively parallel searches may be performed in many different cortical areas [10].

 Differences between Boltzmann machine and the cortex:

 Binary states and action potentials

The simple binary units which are components of Boltzmann machines are not literal models of cortical neurons. According with Hinton, the assumptions that the binary units change their states asynchronously and they use a probabilistic decision rule seem closer to the reality than a model with synchronously deterministic updating [10]. The energy gap for a binary unit has a role similar to that played by the membrane potential for a neuron: both are the sum of excitatory and inhibitory inputs and both are used to determine the output state. However, the cortical neurons produce action potentials, which are brief spikes that propagate down axons, rather than binary outputs. When an action potential reaches a synapse, the signal it produces in the postsynaptic neuron rises to a maximum and then exponentially decays with the time constant of the membrane (typically around five milliseconds for neurons in cerebral cortex). The effect of a single spike on the postsynaptic cell body may be further broadened by electrotonic transmission down the dendrite to the cell body [10]. The energy gap represents the summed input from all the recently active binary units. If the average time between updates is identified with the average duration of a postsynaptic potential, then the binary pulse between updates can be considered an approximation to the postsynaptic potential. Although the shape of a single binary pulse differs significantly from a postsynaptic potential, the sum of a large number of stochastic pulses is independent of the shape of the individual pulses and depends only on their amplitudes and durations [10]. According with Hinton, for large networks having the large fan–ins typical of cerebral cortex (around 10000), the binary approximation may not be too bad [10].

 Implementing pseudo–temperature in units

91

The membrane potential of a neuron is graded, but if it exceeds a fairly sharp threshold, an action potential is produced, followed by a refractory period lasting several milliseconds, during which another action potential cannot be elicited. If Gaussian noise is added to the membrane potential, then even if the total synaptic input is below threshold, there is a finite probability that the membrane potential will reach threshold [10]. The amplitude of the Gaussian noise determines the width of the sigmoidal probability distribution for the unit to fire during a short time interval and it therefore plays the role of pseudo–temperature in the model [10]. According with Hinton, a cumulative Gaussian is a very good approximation to the required probability distribution but it might be difficult to implement because the units in the network should be arranged in such a way that all of them have the same amplitude of noise [10].

 Asymmetry and time–delays

In a generic Boltzmann machine all the connections are symmetrical. This assumption is not always true for neurons in the cerebral cortex. However, if the constraints of a problem are inherently symmetrical and if the network, on average, approximates the required symmetrical connectivity, then random asymmetries in a large network will be reflected as an increase in the Gaussian noise in each unit [10]. Hinton proposes the following experiment to see why random asymmetry acts as Gaussian noise:

Consider a symmetrical network in which pairs of units are linked by two equal one–way connections, one in each direction. Then perform the following operation on all pairs of these one–way connections: remove one of the connections and double the strength of the other. Provided the choice of which connection to remove is made randomly, this operation will not alter the expected value of the input to a unit from the other units. On average, it will “see” half as many other units but with twice the weight. So if a unit has a large fan–in, it will be able to make a good unbiased estimate of what its total input would have been if the links had not been cut. However, the use of fewer, larger weights will increase the variance of the energy gap and will thus act as added noise.

Experimentally Hinton came to the conclusion that time–delays act like added noise as well. His experimental results have been confirmed mathematically for first order constraints, provided the fan–in is large and the weights are small compared with the energy gaps [10].

92

Chapter 5. The Mathematical Theory of Learning Algorithms for Boltzmann Machines

One of the most interesting aspects of the Boltzmann machine formalization is that it leads to a domain–independent learning algorithm [10]. Intuitively, learning for Boltzmann machines means "acquiring a particular behavior by observing it” [3], i.e., progressively adjusting the connection strengths between units in such a way that the whole network develops an internal model which captures the underlying structure of the environment [10]. The goal of learning in Boltzmann machines is rather different from other learning algorithms like, for instance, learning. Rather than learning a non–linear model from inputs to outputs, the goal of learning in the classical asynchronous Boltzmann machine is to improve the network’s model of the structure of the environment by choosing the parameters of the network such that the stochastic behaviour observed on the visible units when the network is free– running closely models that observed in the environment [43].

5.1 Problem description

The formal definition of the learning process we present is inspired by Sussmann’s work [1,3] but in the same time reflects our understanding of this family of algorithms. Before we formalize the learning process, we lay out the context it operates in. Consider we are given a Boltzmann machine 퐁퐌 = (𝒩, 𝒢, Ŵ , Θ) with |𝒩| = 푛 and with the set of random variables associated to the units denoted X = (푋1, 푋2, … , 푋푛). In this way we establish the connection between the Boltzmann machine learning and the Markov networks discussed in Chapter 2 and Chapter 3. Thus, a configuration 휎 of 퐁퐌 is nothing else than an instantiation of the set of random variables X of the underlying Markov network.

According with the definition (4.23), the true probability distribution 퐏 of a joint configuration in a

Boltzmann machine is, in fact, the Gibbs measure 퐆퐖 (equation (4.13)) associated to the

Hamiltonian 퐇퐖 (equation (4.18)), which itself would be associated to the parameters 퐖 of the network at thermal equilibrium, if they were possibly known. Because the partition function of a

Boltzmann machine is generally intractable, all these measures – 퐖, 퐇퐖, 퐆퐖, and 퐏 – cannot be determined exactly. Therefore, we resort to their approximations, which in principle are 퐖̅ ,

퐇퐖̅ , 퐆퐖̅ , and 퐏̅. In order to make some proofs easier to grasp, we might use more suggestive

93

notations for some of these variables. If that is the case, we will specify, if applicable, the correspondence between the notations.

Definition 5.1:

Given a Boltzmann machine 퐁퐌 = (𝒩, 𝒢′, W) and a sequence of random configurations 휎 ∈ I𝒩, distributed according with a probability 퐏̅, which are presented to the network as inputs at various times, a learning process 퓛 is a sequence of pairs (W̅ , 휎) ∈ ℝ𝒢′ × I𝒩 that satisfy the following property: the parameters W̅ converge to a value W such that the corresponding Gibbs measure 퐆퐖 is the same as the observable distribution 퐏̅ of the configurations 휎 presented to the network.

lim W̅ = W such that 퐆퐖 = 퐏̅ (5.1) 푡→∞

A variant of the learning process has 𝒩 split into two disjunctive sets 풱 (visible unis) and ℋ (hidden units). Thus, the observable distribution is a probability distribution over I풱 and the learning process evolves in ℝ𝒢′ × I풱. Definition 5.2 characterizes this scenario.

Definition 5.2:

Given a Boltzmann machine 퐁퐌 = (𝒩, 𝒢′, W) and a sequence of random data vectors 푣 ∈ I풱, distributed according with a probability 퐏̅, that are presented to the network as inputs at various times, a learning process 퓛 is a sequence of pairs (W̅ , 푣) ∈ ℝ𝒢′ × I풱 that satisfy the following property: the parameters W̅ converge to a value W such that the marginal of the corresponding 풱 Gibbs measure 퐆퐖 over the variables 푣 ∈ I is the same as the observable distribution 퐏̅ of the visible vectors 푣 presented to the network.

lim W̅ = W such that MARG(퐆퐖, 풱) = 퐏̅ (5.2) 푡→∞

In reality, MARG(퐆퐖, 풱) and 퐏̅ are not equal, so Definition 5.2 rather expresses a desired goal of the learning process. Therefore, the aim of asynchronous Boltzmann machine learning becomes to reduce the difference between MARG(퐆퐖, 풱) and 퐏̅ by performing gradient descent in the parameter space on a suitable measure of their difference.

The environment imposes the distribution 퐏̅ over the network by clamping the visible units, which means the following:

94

 each member of I퓥 is probabilistically selected using 퐏̅; the probability of selecting 푣 is 퐏̅(푣);  the selected members of I퓥 are presented to the network sequentially;  each selected vector 푣 is tested by running the Boltzmann machine for a time unit long enough for the network to reach thermal equilibrium;  in each time unit the following two steps take place:

 Step 1: all the units are updated;  Step 2: the visible units are reset to 푣.

We introduce the following definitions and notations for the probability distributions that play a role in asynchronous Boltzmann machine learning. Then, we summarize them in Table 1.

 Let 퐏퐓(σ) = 퐏퐓(푣, ℎ) be the free running equilibrium distribution at pseudo–temperature 퐓.

 Let 퐏퐓(ℎ|푣) be the probability of the free running network, at thermal equilibrium, that the hidden units are set to ℎ given that the visible units are set to 푣 on the very same time step.

 Let 퐩퐓(푣) be the probability distribution over the states of the visible units when the network in thermal equilibrium is running freely at pseudo–temperature 퐓.  Let 퐪(푣) be the environmentally imposed probability distribution over the state vectors 푣 of visible units.

 Let 퐐퐓(ℎ|푣) be the probability that vector ℎ will occur on the hidden units when 푣 is clamped on the visible units and the network of hidden units is allowed to run at pseudo–temperature 퐓.

 Let 퐐퐓(휎) = 퐐퐓(푣, ℎ) be the probability of observing the global state 휎 over multiple runs in which successive vectors 푣 are clamped with probability 퐪(푣).

 퐏퐓 represents the probability described as 퐆퐖 in Definition 5.2.

 퐩퐓 represents the probability described as MARG(퐆퐖, 풱) in Definition 5.2.  퐪 represents the probability described as 퐏̅ in Definition 5.2.

Table 1 Distributions of interest in asynchronous Boltzmann machine learning Distribution Visible units Notation

Equilibrium (true) distribution: 퐏퐓 ≡ 퐆퐖 Clamped 퐩퐓(푣)

Is defined on the whole state space 퓝 = 퓥 ∪ 퓗. Free–running 퐏퐓(σ)

Environmental (data) distribution: 퐐퐓 Clamped 퐪(푣)

Is defined on the state space 퓝 and observable on visible space 퓥. Free–running 퐐퐓(휎)

95

There is a subtle difference between the conditional distribution 퐏퐓(ℎ|푣), which refers to the free process, and 퐐퐓(ℎ|푣), which refers to the clamped process. During the free process, the visible units are allowed to change on every time step; therefore, 퐏퐓(ℎ|푣) quantifies the probability that the network arrives at configuration (푣, ℎ) on the very same time step. During the clamped process, the visible units are initially set to 푣 and only the network of hidden units is allowed to freely run; therefore, 퐐퐓(ℎ|푣) quantifies the probability that the network of hidden units arrives at configuration ℎ in a time step following the initial time step when the visible units have been clamped.

A formal description of the learning problem in Boltzmann machine is presented below.

Problem Boltzmann Machine Learning:

Given: 𝒩, 𝒢′ and the split of the units into visible and hidden: 𝒩 = 풱 ∪ ℋ

a set of data vectors 푣 ∈ I풱 used as inputs at various times

푣 are distributed according with an observable 퐪

퐪/퐐퐓 belong to an exponential family and have parameters W̅

퐪 = 퐩퐓

Find: the best possible W̅ close to W

Subject to:

1. a 퐁퐌 with visible units 푣 and hidden units ℎ runs on all its 푛 units

according to a probability 퐏퐓 ≡ 퐆퐖 that has parameters W

2. W̅ converge to W

The learning algorithms for Boltzmann machines build on approximate inference algorithms in pairwise Markov networks. Based on the approach employed to perform approximate inference, the learning algorithms for Boltzmann machines can be divided into two groups or families:

 one family uses approximate maximum likelihood methods;  other family uses variational methods to compute the free energies.

96

5.2 Phases of a learning algorithm in a Boltzmann Machine

By performing learning, the Boltzmann machine captures the underlying structure of its environment and becomes capable of performing various pattern completion tasks. One type of such tasks is to be able to complete a pattern from any sufficiently large part of it without knowing in advance which part must be completed. Another type of task is to know in advance which parts of the pattern will be given as input and which parts will have to be completed as output. Therefore, there are two pattern completion paradigms, which lead to the presence of two phases in the learning procedure such that each phase corresponds to a paradigm.

Before we study these phases, we introduce the following parameters:

 훿 ∈ ℝ, 훿 > 0 is a constant of proportionality called the learning rate;  푝푎 ∈ ℕ, 푝푎 > 0 is the number of patterns shown to the network;  푒푝 ∈ ℕ, 푒푝 > 0 is the number of learning cycles (epochs) during which the algorithm sees all the patterns. An epoch, which is a complete pass through a given dataset, should not be confused with an iteration, which is simply one update of the neural net model’s parameters.

A suggestive designation of these phases belongs to Sussmann, who called them "hallucinating phase" respectively "learning phase" [1,3]. Generally, during a learning phase, a pattern 푣 ∈ I풱 is "taught" by clamping the units 푖 ∈ 풱 so that their activation levels 휎푖 are the same as 푣푖 and allowing the hidden units to evolve according to the Metropolis dynamics. In this phase the weights are adjusted according to the Hebb rule, that is, at each step each weight 푤̂푖푗 is incremented by the positive quantity:

Δ푤̂푖푗 = 훿 ∙ (2휎푖 − 1) ∙ (2휎푗 − 1) (5.3) where 훿 is the learning rate.

During the hallucinating phase the whole network evolves following the Metropolis dynamics. The adjustment that takes place in this phase is similar to the one from the learning phase, except that now the quantity Δ푤̂푖푗 added to each weight 푤̂푖푗 is negative.

Now that we know that the learning algorithm has two phases; how are these phases linked temporally? The answer to this question is: the learning and hallucinating phases should alternate.

Hinton justifies the alternation of phases by using a well–known method for identifying the parameters of an unknown probability distribution: maximum likelihood estimation [10]. Hinton

97

calls these two phases either “collecting data–independent statistics” respectively “collecting data–dependent statistics”, or “negative phase” respectively “positive phase”.

According with Hinton, we can formulate the learning problem as one of minimizing the distance between two Gibbs measures: the environmental measure 퐪 and the measure 퐩퐓 =

MARG(퐆퐖, 풱) where 퐆퐖 describes the behavior of the network at equilibrium. Then the gradient of this distance, regarded as a function of the parameters, is a difference of two terms: one term consists of the mean of the product (2휎푖 − 1) ∙ (2휎푗 − 1) with respect to 퐪; the other term is the mean of the same quantity with respect to 퐆퐖. Hinton’s claim is that the positive phase computes approximately the first term and the negative phase computes approximately the second term.

Sussmann gives another justification for alternating the phases of the learning procedure [1,3]. He claims that it is not possible for the whole learning procedure to have only the learning phase because, if that happens, the weights would blow up. Sussman’s explanation is that, during the learning phase, the network is doing the "correct" thing (i.e., the configurations 휎 = (푣, ℎ) where 푣 ∈ I풱 have "correct" values), because it has been forced to by clamping the visible units 푖 ∈ 풱 at values that correspond to a desired pattern. Hence, whatever the network is doing, it should be reinforced. If a particular product (2휎푖 − 1) ∙ (2휎푗 − 1) happens to be positive, it means that the net "wants" 휎푖 and 휎푗 to be “in sync”; hence, the weight 푤̂푖푗 should be increased to make this more likely. This means that the connection between the units 푖 and 푗 should be made more

“excitatory" by making 푤̂푖푗 more positive, e.g. by adding to it the positive number Δ푤̂푖푗. Similarly, if the product (2휎푖 − 1) ∙ (2휎푗 − 1) is negative, 푤̂푖푗 should be decreased, and once again this will be achieved by adding the negative number Δ푤̂푖푗 to it.

Furthermore, if the learning algorithm had only the learning phase, then some weights would keep increasing. Indeed, assume that the weights are updated at every step of the learning process and we just look at a pair of visible units 푖 and 푗. If we only performed the learning phase as outlined above, then after 푒푝 × 푝푎 steps the weight 푤̂푖푗 would become:

푤̂푖푗 + 훿 ∙ 푒푝 ∙ 푝푎 ∙ 〈 (2휎푖 − 1) ∙ (2휎푗 − 1) 〉 (5.4) where 〈 (2휎푖 − 1) ∙ (2휎푗 − 1) 〉 represents the sample mean of the product (2휎푖 − 1) ∙ (2휎푗 − 1) for the sample consisting of the 푝푎푡 patterns 푣(1), 푣(2), … , 푣(푝푎) used in the training.

If we assume that the patterns 푣(1), 푣(2), … , 푣(푝푎) are independent and identically distributed, or (푘) more generally, that the Markov process {푣 }푘 is ergodic, then the sample mean for the pair of

98

visible units 푖 and 푗 will converge almost surely to the expected value of (2휎푖 − 1) ∙ (2휎푗 − 1) with respect to the measure 퐪. Unless this expected value happens to vanish, the weight 푤̂푖푗 will blow up as 푡 → +∞. Therefore, Sussmann concludes that something else has to be done to prevent this from happening and this could very well be the presence alternatively of the hallucinating phase.

5.3 Learning algorithms based on approximate maximum likelihood

One modality to find the parameters of the Boltzmann Machine Learning problem is by means of maximum likelihood estimation. Maximum likelihood principle relies on Bayes theorem. Thus, there is only a single data set data (namely the one that is actually observed) and the uncertainty in the parameters of the model is expressed through a probability distribution over parameters. Maximum likelihood estimation has the parameters set to the value that maximizes 퐏(params | data), i.e., it chooses the parameters such that the probability of the observed data set is maximized. A variant of this principle, very well suited for exponential models, maximizes the log likelihood of the parameters log 퐏(params | data).

According with Definition 5.2, the goal of asynchronous Boltzmann machine learning is to minimize the difference between MARG(퐆퐖, 풱) and 퐏̅, which translates into minimizing the difference between 퐩퐓 and 퐪. That is equivalent with maximizing the log likelihood of generating the environmental distribution 퐐퐓 when the network is running freely at equilibrium [43].

Regardless of the path chosen – maximizing the log likelihood of 퐐퐓 or minimizing the difference between 퐩퐓 and 퐪 – the end result is the same: W̅ . The path we follow in this paper to obtain W̅ is by minimizing the difference between 퐪 and 퐩퐓, where the difference is expressed by their

KL–divergence KL(퐪||퐩퐓). Essentially, the KL–divergence of two probability distributions is always positive and becomes zero if and only if those probabilities are equal (equations (3.34) and (3.35)).

99

5.3.1 Learning by minimizing the KL–divergence of Gibbs measures

The aim of this section is to present the generic Boltzmann machine learning algorithm proposed by Ackley, Hinton, and Sejnowski in [11] (see also [43]). In essence the generic learning algorithm proposed by Ackley et al. in [11,43] computes locally the difference between two statistics and uses the result to update the “local” parameters. One statistics is the expectation with respect to the data distribution, i.e., the environmentally imposed distribution

퐐퐓, and the other statistic is the expectation with respect to the true distribution, i.e., the Gibbs measure 퐏퐓. We will introduce the formulae for these expectations later in this section.

In order to derive a measure of how effectively the weights in the network are being used for modelling the environment, Ackley et al. have made the assumption that there is no structure in the sequential order of the environmentally clamped vectors. However, Ackley et al. admitted that this is not a realistic assumption and a more realistic assumption would be that the complete structure of the ensemble of the environmentally clamped vectors can be specified by giving the probability of each of the 2푚 vectors over the 푚 visible units [11,43].

The version of the generic Boltzmann machine learning algorithm we present is inspired by [42,61] and reflects our understanding of this important algorithm. We start by evaluating the effect that clamping a data vector onto the visible units has over a hidden unit. In order to accomplish this, we need to establish a new relationship between the vectors 푣 and ℎ.

Claim: Let consider a configuration 휎 = (푣, ℎ) where: 휎 ∈ I𝒩, 푣 ∈ I풱, and ℎ ∈ Iℋ. Then 푣 and ℎ are orthogonal in I𝒩.

Proof: In order to compute the Euclidean inner product between 푣 and ℎ in I𝒩 we need to represent both 푣 and ℎ as configurations in I𝒩. We do this by “packing” with zeros a data (visible) vector 푣 ∈ I풱, up to the dimension 푛 of a configuration 휌 ∈ I𝒩, such that:

∀푖 ∈ 풱, 휌푖 = 푣푖 and ∀푗 ∈ 𝒩 − 풱 = ℋ, 휌푗 = 0 (5.5)

We apply the same “packing” operation, up to the dimension 푛 of a configuration 휏, to any hidden vector ℎ ∈ Iℋ such that:

∀푗 ∈ ℋ, 휏푗 = ℎ푗 and ∀푖 ∈ 𝒩 − ℋ = 풱, 휏푖 = 0 (5.6)

100

Then the inner product 휌 ∙ 휏T = 휏 ∙ 휌T is zero because the zero components of both configurations 휌 and 휏 coming from “packing” are placed at mutually exclusive indices. Therefore, 휌 and 휏 are orthogonal, which leads to 푣 and ℎ being orthogonal in I𝒩.

More, a configuration 휎 over 푣 and ℎ can be represented either using concatenation between 푣 and ℎ or using the sum between the “expanded” versions 휌 of 푣 and 휏 of ℎ:

휎 = (푣, ℎ) = 휌 + 휏 ≝ 푣 + ℎ (5.7)

A consequence of the equation (5.6) is the fact that the equation (4.28) can be rewritten as:

풱 퐩퐓(푣) = ∑ 퐏퐓 (푣, ℎ) = ∑ 퐏퐓 (푣 + ℎ) for 푣 ∈ I (5.8) ℎ∈Iℋ ℎ∈Iℋ

We are now going to evaluate the activation of a hidden unit 푖 ∈ ℋ due to clamping of the visible units 푣. We do this by distinguishing between the contribution of the hidden units and the contribution of the visible units to the net input of that unit:

net(푖, 휎) = ∑ ℎ푗 ∙ 푤̂푗푖 + ∑ 푣푗 ∙ 푤̂푖푗 − 휃푖 (5.9) 푗∈ℋ 푗∈풱 푗≠푖 (푗≠푖 )

The terms included in the bracket in (5.9) do not depend on ℎ. More, when the visible units 푣 are clamped, the content of the bracket, denoted 훉퐢 and called the effective threshold of unit 푖, is a constant that acts as a threshold for the unit 푖 of subnet ℋ.

(5.10) 훉퐢 = 휃푖 − ∑ 푣푗 ∙ 푤̂푖푗 푗∈풱 푗≠푖 Then: (5.11) net(푖, 휎) = ∑ ℎ푗 ∙ 푤̂푗푖 − 훉퐢 푗∈ℋ 푗≠푖

The subnet ℋ behaves like a Boltzmann machine with its own interconnecting weights Ŵ and thresholds (훉퐢)푖∈ℋ. This means that, in principle, we know the probability of any particular state or configuration ℎ of ℋ simply because it will be determined by a Boltzmann–Gibbs distribution. To use this fact we need to know the relationship between the internal energy of subnet ℋ operating with effective thresholds (훉퐢)푖∈ℋ and the energy of the whole network 𝒩 of 퐁퐌. The next theorem makes this relationship explicit by means of an algebraic identity.

101

Theorem 5.1 (Jones [63]):

The energy of the whole network of 퐁퐌 can be computed as the sum between the internal energy of the subnet ℋ in state ℎ when vector 푣 is clamped and the internal energy of the subnet 풱 in state 푣 when is completely disconnected from the units of ℋ. Formally, we write:

Given: 1 (5.12) 퐸 (ℎ|푣) = − ∑ ℎ ∙ ∑ ℎ ∙ 푤̂ + ∑ 훉 ∙ ℎ ℋ 2 푗 푖 푖푗 퐣 푗 푗∈ℋ 푖∈ℋ 푗∈ℋ 푖≠푗 And 1 (5.13) 퐸 (푣) = − ∑ 푣 ∙ ∑ 푣 ∙ 푤̂ + ∑ 휃 ∙ 푣 풱 2 푗 푖 푖푗 푗 푗 푗∈풱 푖∈풱 푗∈풱 푖≠푗 Then: 퐸(휎) = 퐸ℋ(ℎ|푣) + 퐸풱(푣) where 휎 = 푣 + ℎ (5.14)

Proof: We start from the energy of a joint configuration given by the equation (4.37):

퐸(푣, ℎ) = − ∑ 휎푖 ∙ ∑ 휎푗 ∙ 푤̂푖푗 + ∑ 휎푖 ∙ 휃푖 푖∈𝒩 푗∈𝒩, 푖∈𝒩 푗>푖

1 퐸(푣, ℎ) = − ∙ ∑ 휎 ∙ ∑ 휎 ∙ 푤̂ + ∑ 휎 ∙ 휃 2 푖 푗 푖푗 푖 푖 푖∈𝒩 푗∈𝒩, 푖∈𝒩 푗≠푖

1 퐸(푣, ℎ) = − ∙ ∑ 휎 ∙ ∑ (푣 + ℎ ) ∙ 푤̂ + ∑ 휎 ∙ 휃 2 푖 푗 푗 푖푗 푖 푖 푖∈𝒩 푗∈𝒩, 푖∈𝒩 푗≠푖

1 1 퐸(푣, ℎ) = − ∙ ∑ 휎 ∙ ∑ 푣 ∙ 푤̂ − ∙ ∑ 휎 ∙ ∑ ℎ ∙ 푤̂ + ∑ 휎 ∙ 휃 2 푖 푗 푖푗 2 푖 푗 푖푗 푖 푖 푖∈𝒩 푗∈풱, 푖∈𝒩 푗∈ℋ, 푖∈𝒩 푗≠푖 푗≠푖

1 1 퐸(푣, ℎ) = − ∙ ∑ 푣 ∙ ∑ 휎 ∙ 푤̂ − ∙ ∑ ℎ ∙ ∑ 휎 ∙ 푤̂ + ∑ 휎 ∙ 휃 2 푗 푖 푖푗 2 푗 푖 푖푗 푖 푖 푗∈풱 푖∈𝒩, 푗∈ℋ 푖∈𝒩, 푖∈𝒩 푖≠푗 푖≠푗

1 1 퐸(푣, ℎ) = − ∙ ∑ 푣 ∙ ∑ 푤̂ ∙ (푣 + ℎ ) ∙ − ∙ ∑ ℎ ∙ ∑ 푤̂ ∙ (푣 + ℎ ) + ∑ 휃 ∙ (푣 + ℎ ) 2 푗 푖푗 푖 푖 2 푗 푖푗 푖 푖 푖 푖 푖 푗∈풱 푖∈𝒩, 푗∈ℋ 푖∈𝒩, 푖∈𝒩 푖≠푗 푖≠푗

1 1 퐸(푣, ℎ) = − ∙ ∑ 푣 ∙ ∑ 푣 ∙ 푤̂ − ∙ ∑ 푣 ∙ ∑ ℎ ∙ 푤̂ + 2 푗 푖 푗푖 2 푗 푖 푗푖 푗∈풱 푖∈풱, 푗∈풱 푖∈ℋ, ( 푖≠푗 푖≠푗 ) 102

1 1 + − ∙ ∑ ℎ ∙ ∑ 푣 ∙ 푤̂ − ∙ ∑ ℎ ∙ ∑ ℎ ∙ 푤̂ + (∑ 푣 ∙ 휃 + ∑ ℎ ∙ 휃 ) 2 푗 푖 푗푖 2 푗 푖 푗푖 푗 푗 푗 푗 푗∈ℋ 푖∈풱, 푗∈ℋ 푖∈ℋ, 푖∈풱 푗∈ℋ ( 푖≠푗 푖≠푗 )

We observe that, due to the weight symmetry and the commutative and distributive laws of multiplication versus addition, the second term and the third term of the last formula are identical. Therefore:

1 1 퐸(푣, ℎ) = − ∙ ∑ 푣 ∙ ∑ 푣 ∙ 푤̂ − ∙ ∑ ℎ ∙ ∑ ℎ ∙ 푤̂ − ∑ ℎ ∙ ∑ 푣 ∙ 푤̂ + 2 푗 푖 푗푖 2 푗 푖 푗푖 푗 푖 푗푖 푗∈풱 푖∈풱, 푗∈ℋ 푖∈ℋ, 푗∈ℋ 푖∈풱, 푖≠푗 푖≠푗 푖≠푗

+ ∑ 푣푗 ∙ 휃푗 + ∑ ℎ푗 ∙ 휃푗 푗∈풱 푗∈ℋ

1 1 퐸(푣, ℎ) = − ∙ ∑ 푣 ∙ ∑ 푣 ∙ 푤̂ + ∑ 푣 ∙ 휃 − ∙ ∑ ℎ ∙ ∑ ℎ ∙ 푤̂ + 2 푗 푖 푗푖 푗 푗 2 푗 푖 푗푖 푗∈풱 푖∈풱, 푗∈풱 푗∈ℋ 푖∈ℋ, ( 푖≠푗 ) 푖≠푗

+ ∑ ℎ푗 ∙ (휃푗 − ∑ 푣푖 ∙ 푤̂푗푖) j∈ℋ 푖∈풱, 푖≠푘

We observe that the content of the first bracket is exactly 퐸풱(푣) given by the equation (5.13) and the content of the second bracket is exactly 훉퐣 given by the equation (5.10). Therefore:

1 퐸(푣, ℎ) = 퐸 (푣) + − ∙ ∑ ℎ ∙ ∑ ℎ ∙ 푤̂ + ∑ ℎ ∙ 훉 풱 2 푗 푖 푗푖 푗 퐣 푗∈ℋ 푖∈ℋ, j∈ℋ ( 푖≠푗 )

We observe that the content of the bracket is exactly 퐸ℋ(ℎ|푣) given by the equation (5.12). Thus, we obtain exactly the equation (5.14):

퐸(휎) = 퐸(푣, ℎ) = 퐸풱(푣) + 퐸ℋ(ℎ|푣)

With respect to Theorem 5.1 we remark that 퐸풱(푣) is constant when 푣 is clamped on 풱. This makes the calculation of the probability of any particular vector ℎ on the hidden units particularly straightforward. Therefore, we take a closer look at 퐐퐓(ℎ|푣), i.e., the probability that vector ℎ will occur on the hidden units when 푣 is clamped on the visible units and ℋ is allowed to run at

103

pseudo–temperature 퐓. Intuitively, the only effect of 푣 on ℎ is to cause the hidden units ℎ to run with effective thresholds 훉퐣 given by the equation (5.10) instead of their regular thresholds 휃푗.

Under these circumstances, 퐐퐓(ℎ|푣) is governed by the same type of distribution as the network itself, which in our case is the Boltzmann–Gibbs distribution.

Corollary 5.1:

퐐퐓(ℎ|푣) is proportional to the probability 퐏퐓(푣, ℎ) of the joint configuration 휎 = (푣, ℎ):

퐐퐓(ℎ|푣) = 훼(푣, 퐓) ∙ 퐏퐓(푣, ℎ) (5.15) where: 휎 = (푣, ℎ) is a configuration of the network; 퐓 is the pseudo–temperature; and 훼(푣, 퐓) is a positive constant depending only on 푣 and 퐓.

Proof: A consequence of Theorem 5.1 is that, in a Boltzmann machine with visible units 푣 clamped and with hidden units ℎ, 퐐퐓(ℎ|푣) is governed by the Boltzmann–Gibbs distribution given by the equation (2.5). The energy corresponding to 퐐퐓(ℎ|푣) according with the equation (2.5) can be obtained from the equation (5.14). Therefore, we can write:

1 −퐸ℋ(ℎ|푣) 1 −퐸(푣, ℎ) + 퐸풱(푣) 퐐퐓(ℎ|푣) = ∙ exp ( ) = ∙ exp ( ) 푍ℋ 퐓 푍ℋ 퐓

1 −퐸(푣, ℎ) 퐸풱(푣) 퐐퐓(ℎ|푣) = ∙ exp ( ) ∙ exp ( ) 푍ℋ 퐓 퐓 where 푍ℋ is an appropriate normalization constant for the distribution 퐐퐓(ℎ|푣) .

Z 퐸풱(푣) 1 −퐸(푣, ℎ) 퐐퐓(ℎ|푣) = ( ∙ exp ( )) ∙ ( ∙ exp ( )) 푍ℋ 퐓 푍 퐓 where Z is the partition function for the true distribution 퐏퐓(푣, ℎ).

In the previous formula both Z and 푍ℋ are in essence constants, despite the fact that their computation is intractable. We also observe that the first factor–bracket depends only on 푣 and

퐓, which are both constant with respect to ℎ. More, the second factor–bracket is exactly 퐏퐓(푣, ℎ) given by the equation (4.40). If we denote the first factor–bracket by 훼(푣, 퐓), then we obtain the same expression for 퐐퐓(ℎ|푣) as in the equation (5.15):

퐐퐓(ℎ|푣) = 훼(푣, 퐓) ∙ 퐏퐓(푣, ℎ)

104

The following theorem is essential for the Boltzmann machine learning algorithm. The theorem gives the relationship between 퐐퐓(푣, ℎ) and 퐏퐓(푣, ℎ) in terms of the observable probability 퐪(푣) and the marginal probability 퐩퐓(푣). It shows that, in the particular case of the Boltzmann–Gibbs distribution, this relationship has a simple ratio form.

There is a little bit of history around this theorem in the sense that, in the original derivation of the learning rule for the asynchronous Boltzmann machine, Ackley et al. assumed, without making direct appeal to the form of the underlying distribution, that at thermal equilibrium, the probability of a hidden state given a visible state is the same regardless how the visible units arrived there (clamped or free running) [11,43]. However, for systems with a distribution different from Boltzmann–Gibbs, like, for example, a synchronous Boltzmann machine, this theorem is false and the relationship is much more complicated [42,61]. This means that the classical arguments supporting Theorem 5.2 are logically inadequate, although the conclusion is correct [42]. The missing piece from the original proof was identified and the logic of the original derivation was clarified by Jones in [63].

Theorem 5.2 (Jones [63]):

If the true distribution 퐏퐓 over the whole network of 퐁퐌 is described by the Boltzmann–Gibbs distribution given by the equation (2.5), then the environmental distribution 퐐퐓 is given by:

퐪(푣) 퐐퐓(푣, ℎ) = ∙ 퐏퐓(푣, ℎ) (5.16) 퐩퐓(푣)

Proof: We start from the equation (5.15) and we sum over all possible ℎ ∈ Iℋ. We also take into consideration the fact that 훼(푣, 퐓) is independent of ℎ.

퐐퐓(ℎ|푣) = 훼(푣, 퐓) ∙ 퐏퐓(푣, ℎ)

∑ 퐐퐓(ℎ|푣) = ∑ 훼(푣, 퐓) ∙ 퐏퐓(푣, ℎ) = 훼(푣, 퐓) ∙ ∑ 퐏퐓(푣, ℎ) ℎ∈Iℋ ℎ∈Iℋ ℎ∈Iℋ

We observe that the sum of probabilities on the left–side should be 1 and the sum of probabilities on the right side is exactly the marginal MARG(퐏퐓, 풱)(푣) = 퐩퐓(푣) of the true distribution over the states of the visible units. More, in Section 4.1, when constructing the Gibbs measure associated to a Hamiltonian, we assumed that the corresponding Gibbs measure is a nondegenerate probability distribution, so we can divide by it without restriction. These observations lead us to the following:

105

1 1 = 훼(푣, 퐓) ∙ 퐩퐓(푣) ⇔ 훼(푣, 퐓) = 퐩퐓(푣)

Based on our definition of 퐐퐓(푣, ℎ) in Section 5.1, we can write:

1 퐐 (푣, ℎ) = 퐐 (ℎ|푣) ∙ 퐪(푣) ⇔ 퐐 (ℎ|푣) = ∙ 퐐 (푣, ℎ) 퐓 퐓 퐓 퐪(푣) 퐓

If we substitute 훼(푣, 퐓) and 퐐퐓(ℎ|푣) in the equation (5.15), we obtain exactly the equation (5.16):

1 1 ∙ 퐐퐓(푣, ℎ) = ∙ 퐏퐓(푣, ℎ) 퐪(푣) 퐩퐓(푣)

퐩퐓(푣) ∙ 퐐퐓(푣, ℎ) = 퐪(푣) ∙ 퐏퐓(푣, ℎ)

Lemma 5.3:

The partial derivatives of the KL–divergence between the observable probability 퐪(푣) and the marginal probability 퐩퐓(푣) with respect to the parameters W̅ of the network are computed with the following formulae:

∂KL(퐪||퐩퐓) 퐪(푣) 휕퐏퐓(푣, ℎ) = − ∑ ∙ (5.17) ∂푤̂푖푗 퐩퐓(푣) 휕푤̂푖푗 (푣,ℎ)∈I퓝

∂KL(퐪||퐩퐓) 퐪(푣) 휕퐏퐓(푣, ℎ) = − ∑ ∙ (5.18) ∂휃푖 퐩퐓(푣) 휕휃푖 (푣,ℎ)∈I퓝

Proof: Before we start the proof, we recall that, per Definition 5.2, the learning process computes the parameters W̅ of the network such that lim푡→∞ W̅ = W. This means that the partial ̅̅̅̅̅̅̅̅̅ derivatives of KL(퐪||퐩퐓) should be computed with respect to the parameters W̅ = (Ŵ , Θ) = ((̅̅Ŵ̅̅̅), Θ̅). However, to keep the text as readable as possible, we are going to use the parameters W = (Ŵ , Θ) in our computation but with the meaning of W̅ . We start from the definition of the KL–divergence (equations (B29) and (B30) from Appendix B):

퐪(푣) KL(퐪||퐩퐓) = ∑ 퐪(푣) ∙ ln 퐩퐓(푣) 푣∈I풱

We compute the partial derivative of KL(퐪||퐩퐓) with respect to the weights 푤̂푖푗 of the network by taking into consideration that 퐪(푣) is an environmentally imposed probability distribution, so it doesn’t depend on the parameters of the network.

106

퐪(푣) 퐪(푣) 퐪(푣) ∂ (퐪(푣) ∙ ln ) ∂ (ln ) ∂ ( ) ∂KL(퐪||퐩 ) 퐩 (푣) 퐩 (푣) 퐩 (푣) 퐩 (푣) 퐓 = ∑ 퐓 = ∑ 퐪(푣) ∙ 퐓 = ∑ 퐪(푣) ∙ 퐓 ∙ 퐓 ∂푤̂푖푗 ∂푤̂푖푗 ∂푤̂푖푗 퐪(푣) ∂푤̂푖푗 푣∈I풱 푣∈I풱 푣∈I풱

ퟏ ∂ ( ) ∂KL(퐪||퐩퐓) 퐩퐓(푣) 1 ∂퐩퐓(푣) = ∑ 퐩퐓(푣) ∙ 퐪(푣) ∙ = − ∑ 퐩퐓(푣) ∙ 퐪(푣) ∙ ퟐ ∙ ∂푤̂푖푗 ∂푤̂푖푗 퐩퐓(푣) ∂푤̂푖푗 푣∈I풱 푣∈I풱

∂KL(퐪||퐩 ) 퐪(푣) ∂퐩 (푣) 퐓 = − ∑ ∙ 퐓 ∂푤̂푖푗 퐩퐓(푣) ∂푤̂푖푗 푣∈I풱

Furthermore, we substitute 퐩퐓(푣) with its expression given by the equation (5.8):

∂KL(퐪||퐩 ) 퐪(푣) ∂ ∑ ℋ 퐏 (푣, ℎ) 퐪(푣) ∂퐏 (푣, ℎ) 퐓 = − ∑ ∙ ℎ∈I 퐓 = − ∑ ∙ ∑ 퐓 ∂푤̂푖푗 퐩퐓(푣) ∂푤̂푖푗 퐩퐓(푣) ∂푤̂푖푗 푣∈I풱 푣∈I풱 ℎ∈Iℋ

Earlier we proved that 푣 and ℎ are orthogonal (equation 5.7), which translates into the following:

∑ ∑ = ∑ 푣∈I풱 ℎ∈Iℋ (푣,ℎ)∈I퓝

If we apply the orthogonality of 푣 and ℎ to the previous expression of the partial derivative of

KL(퐪||퐩퐓) with respect to the weights 푤̂푖푗, we obtain exactly the formula (5.17):

∂KL(퐪||퐩 ) 퐪(푣) ∂퐏 (푣, ℎ) 퐪(푣) ∂퐏 (푣, ℎ) 퐓 = − ∑ ∑ ∙ 퐓 = − ∑ ∙ 퐓 ∂푤̂푖푗 퐩퐓(푣) ∂푤̂푖푗 퐩퐓(푣) ∂푤̂푖푗 푣∈I풱 ℎ∈Iℋ (푣,ℎ)∈I퓝

The proof for the formula (5.18) is very similar with the proof for the formula (5.17) except that the weights 푤̂푖푗 are replaced with the thresholds 휃푖.

Before presenting the most important result of the learning rule derivation for the asynchronous symmetric Boltzmann machine, we introduce the expectations mentioned at the beginning of this section. Given a configuration 휎 = (푣, ℎ) of the network, we denote by 푞푖푗 the expectation with respect to the data distribution 퐐퐓, i.e., the data probability averaged over all environmental th th inputs and measured at equilibrium that the i and the j units are both on. We denote by 푝푖푗 the expectation with respect to the true distribution 퐏퐓, i.e., the true probability distribution measured at equilibrium that the ith and the jth units are both on.

푞푖푗 = ∑ 휎푖 ∙ 휎푗 ∙ 퐐퐓(휎) if 푖 ≠ 푗 (5.19) 휎∈I𝒩

107

푝푖푗 = ∑ 휎푖 ∙ 휎푗 ∙ 퐏퐓(휎) if 푖 ≠ 푗 (5.20) 휎∈I𝒩

Similarly, we denote by 푞푖 the data probability averaged over all environmental inputs and th measured at equilibrium that the i unit is on and by 푝푖 the true probability distribution measured at equilibrium that the ith unit is on.

푞푖 = ∑ 휎푖 ∙ 퐐퐓(휎) (5.21) 휎∈I𝒩

푝푖 = ∑ 휎푖 ∙ 퐏퐓(휎) (5.22) 휎∈I𝒩

Theorem 5.4 (Gradient–Descent for asynchronous symmetric Boltzmann machines):

The partial derivatives of the KL–divergence between the environmental probability 퐪(푣) and the marginal of the true probability 퐩퐓(푣) with respect to the symmetric weights 푤̂푖푗 respectively the thresholds 휃푖 are given by the following formulae:

∂KL(퐪||퐩퐓) 1 = − (푞푖푗 − 푝푖푗) (5.23) ∂푤̂푖푗 퐓

∂KL(퐪||퐩퐓) 1 = (푞푖 − 푝푖) (5.24) ∂휃푖 퐓

( ) ( ) Proof: From Lemma 5.3 it is sufficient to determine 휕퐏퐓 휎 respectively 휕퐏퐓 휎 . In order to 휕푤̂푖푗 휕휃푖 compute these partial derivatives, we need to know the expression of the true probability distribution of the network at equilibrium. In Chapter 4 we learned that the joint distribution of a Boltzmann machine is a Boltzmann–Gibbs distribution and is given by the equation (4.40). However, we need to prove that the equation (4.40) also represents the equilibrium distribution of the Boltzmann machine, which we are going to do in Section 5.3.2. In the rest of this section we assume that the equilibrium distribution of the Boltzmann machine is given by the following version of the equation (4.40) which takes into consideration the pseudo–temperature 퐓:

1 −퐸(휎) 퐏 (휎) = 퐏 (푣, ℎ) = 퐏 (푣 + ℎ) = ∙ exp ( ) (5.25) 퐓 퐓 퐓 푍 퐓

We observe that both the numerator and the denominator of 퐏(휎) depend on the weights 푤̂푖푗 respectively the thresholds 휃푖.

108

휕퐏 (휎) 휕 1 −퐸(휎) 퐓 = ( ∙ exp ( )) 휕푤̂푖푗 휕푤̂푖푗 푍 퐓

휕퐏퐓(휎) 1 −퐸(휎) 휕푍 1 −퐸(휎) 휕퐸(휎) = − 2 ∙ exp ( ) ∙ − ∙ exp ( ) ∙ 휕푤̂푖푗 푍 퐓 휕푤̂푖푗 푍 ∙ 퐓 퐓 휕푤̂푖푗

휕퐏퐓(휎) 1 휕푍 1 휕퐸(휎) = −퐏퐓(휎) ∙ ( ∙ + ∙ ) (5.26) 휕푤̂푖푗 푍 휕푤̂푖푗 퐓 휕푤̂푖푗

휕퐏퐓(휎) 1 −퐸(휎) 휕푍 1 −퐸(휎) 휕퐸(휎) = − 2 ∙ exp ( ) ∙ − ∙ exp ( ) ∙ 휕휃푖 푍 퐓 휕휃푖 푍 ∙ 퐓 퐓 휕휃푖

휕퐏퐓(휎) 1 휕푍 1 휕퐸(휎) = −퐏퐓(휎) ∙ ( ∙ + ∙ ) (5.27) 휕휃푖 푍 휕휃푖 퐓 휕휃푖

( ) From the formulae (5.26) and (5.27) it is sufficient to determine 휕푍 and 휕퐸 휎 respectively 휕푍 휕푤̂푖푗 휕푤̂푖푗 휕휃푖 휕퐸(휎) and . We start by recalling the expression of 퐸(휎) given by the equation (4.37) and the 휕휃푖 expression of 푍 given by the equation (4.39):

퐸(휎) = − ∑ 휎푖 ∙ 휎푗 ∙ 푤̂푖푗 + ∑ 휎푖 ∙ 휃푖 {푖,푗}∈𝒢 푖∈𝒩 푖<푗

∑{푖,푗}∈𝒢 휎푖 ∙ 휎푗 ∙ 푤̂푖푗 − ∑푖∈𝒩 휎푖 ∙ 휃푖 −퐸(푢, 푔) 푖<푗 푍 = ∑ exp ( ) = ∑ exp ( ) 퐓 퐓 푢∈퐼풱,푔∈퐼ℋ 휎∈I𝒩

The partial derivatives of 퐸(휎) with respect to the weights 푤̂푖푗 respectively the thresholds 휃푖 are:

휕퐸(휎) = −휎푖 ∙ 휎푗 (5.28) 휕푤̂푖푗

휕퐸(휎) = 휎푖 (5.29) 휕휃푖

The partial derivatives of 푍 with respect to the weights 푤̂푖푗 respectively the thresholds 휃푖 are:

∑{푖,푗}∈𝒢 휎푖 ∙ 휎푗 ∙ 푤̂푖푗 − ∑푖∈𝒩 휎푖 ∙ 휃푖 휕푍 푖<푗 휎푖 ∙ 휎푗 = ∑ exp ( ) ∙ 휕푤̂푖푗 퐓 퐓 휎∈I𝒩

109

−퐸(푢, 푔) exp ( ) 휕푍 푍 퐓 = ∙ ∑ 휎푖 ∙ 휎푗 ∙ 휕푤̂푖푗 퐓 푍 휎=(푢,푔)∈퐼𝒩

휕푍 푍 = ∙ ∑ 휎푖 ∙ 휎푗 ∙ 퐏퐓(휎) (5.30) 휕푤̂푖푗 퐓 휎∈퐼𝒩 respectively:

∑{푖,푗}∈𝒢 휎푖 ∙ 휎푗 ∙ 푤̂푖푗 − ∑푖∈𝒩 휎푖 ∙ 휃푖 휕푍 푖<푗 −휎 = ∑ exp ( ) ∙ 푖 휕휃푖 퐓 퐓 휎∈I𝒩

−퐸(푢, 푔) exp ( ) 휕푍 푍 퐓 = − ∙ ∑ 휎푖 ∙ 휕휃푖 퐓 푍 휎=(푢,푔)∈퐼𝒩

휕푍 푍 = − ∙ ∑ 휎푖 ∙ 퐏퐓(휎) (5.31) 휕휃푖 퐓 휎∈퐼𝒩

We substitute the formulae (5.28) and (5.30) into the formula (5.26), respectively the formulae (5.29) and (5.31) into the formula (5.27).

휕퐏퐓(휎) 1 푍 1 = −퐏퐓(휎) ∙ ( ∙ ∙ ∑ 휎푖 ∙ 휎푗 ∙ 퐏퐓(휎) + ∙ (−휎푖 ∙ 휎푗)) 휕푤̂푖푗 푍 퐓 퐓 휎∈퐼𝒩

휕퐏퐓(휎) 퐏퐓(휎) = − ∙ ( ∑ 휎푖 ∙ 휎푗 ∙ 퐏퐓(휎) − 휎푖 ∙ 휎푗) (5.32) 휕푤̂푖푗 퐓 휎∈퐼𝒩

휕퐏퐓(휎) 1 푍 1 = −퐏퐓(휎) ∙ (− ∙ ∙ ∑ 휎푖 ∙ 퐏퐓(휎) + ∙ 휎푖) 휕휃푖 푍 퐓 퐓 휎∈퐼𝒩

휕퐏퐓(휎) 퐏퐓(휎) = ∙ ( ∑ 휎푖 ∙ 퐏퐓(휎) − 휎푖) (5.33) 휕휃푖 퐓 휎∈퐼𝒩

We observe that in the formula (5.32) the first term inside the bracket is exactly 푝푖푗 given by the formula (5.20). Similarly, in the formula (5.33) the first term inside the bracket is exactly 푝푖 given by the formula (5.22).

휕퐏퐓(휎) 퐏퐓(휎) = − ∙ (푝푖푗 − 휎푖 ∙ 휎푗) (5.34) 휕푤̂푖푗 퐓

110

휕퐏퐓(휎) 퐏퐓(휎) = ∙ (푝푖 − 휎푖) (5.35) 휕휃푖 퐓

Furthermore, we substitute the formulae (5.34) and (5.35) into the formulae (5.17) respectively (5.18).

∂KL(퐪||퐩퐓) 퐪(푣) 퐏퐓(푣, ℎ) = ∑ ∙ ∙ (푝푖푗 − 휎푖 ∙ 휎푗) (5.36) ∂푤̂푖푗 퐩퐓(푣) 퐓 (푣,ℎ)∈I퓝

∂KL(퐪||퐩퐓) 퐪(푣) 퐏퐓(휎) = − ∑ ∙ ∙ (푝푖 − 휎푖) (5.37) ∂휃푖 퐩퐓(푣) 퐓 (푣,ℎ)∈I퓝

퐪(푣) We use the formula (5.16) to substitute in the formula (5.36) the expression ∙ 퐏퐓(푣, ℎ) with 퐩퐓(푣)

퐐퐓(푣, ℎ). Thus, the formula we obtain is exactly the formula (5.23).

∂KL(퐪||퐩퐓) 퐪(푣) 퐏퐓(푣, ℎ) 1 = ∑ ∙ ∙ (푝푖푗 − 휎푖 ∙ 휎푗) = ∙ ∑ 퐐퐓(푣, ℎ) ∙ (푝푖푗 − 휎푖 ∙ 휎푗) ∂푤̂푖푗 퐩퐓(푣) 퐓 퐓 (푣,ℎ)∈I퓝 (푣,ℎ)∈I퓝

∂KL(퐪||퐩퐓) 1 = ∙ ( ∑ 푝푖푗 ∙ 퐐퐓(휎) − ∑ 휎푖 ∙ 휎푗 ∙ 퐐퐓(휎)) ∂푤̂푖푗 퐓 휎∈I퓝 휎∈I퓝

∂KL(퐪||퐩퐓) 1 1 = ∙ (푝푖푗 ∙ ∑ 퐐퐓(휎) − 푞푖푗) = − ∙ (푞푖푗 − 푝푖푗) ∂푤̂푖푗 퐓 퐓 휎∈I퓝

퐪(푣) Finally, we use the formula (5.16) to substitute in the formula (5.37) the expression ∙ 퐩퐓(푣)

퐏퐓(푣, ℎ) with 퐐퐓(푣, ℎ). Thus, the formula we obtain is exactly the formula (5.24).

∂KL(퐪||퐩퐓) 퐪(푣) 퐏퐓(휎) 1 = − ∑ ∙ ∙ (푝푖 − 휎푖) = − ∙ ∑ 퐐퐓(푣, ℎ) ∙ (푝푖 − 휎푖) ∂휃푖 퐩퐓(푣) 퐓 퐓 (푣,ℎ)∈I퓝 (푣,ℎ)∈I퓝

∂KL(퐪||퐩퐓) 1 = ∙ ( ∑ 휎푖 ∙ 퐐퐓(휎) − ∑ 푝푖 ∙ 퐐퐓(휎)) ∂휃푖 퐓 휎∈I퓝 휎∈I퓝

∂KL(퐪||퐩퐓) 1 1 = (푞푖 − 푝푖 ∙ ∑ 퐐퐓(휎)) = (푞푖 − 푝푖) ∂휃푖 퐓 퐓 휎∈I퓝

With respect to Theorem 5.4 we remark that the formulae (5.23) and (5.24) show that the process of reaching thermal equilibrium ensures that the joint activity of any two units contains

111

all the information required for changing the weight between them in order to give the network a better model of its environment [43]. Specifically, the joint activity of any two units encodes information explicitly about those units and encodes information implicitly about all the other weights in the network [43]. The formulae (5.23) and (5.24) also show that the joint activity of any two units doesn’t depend on what kind of units they are: both visible, both hidden, or one visible and one hidden.

In practice, to minimize KL(퐪||퐩퐓), it is sufficient to observe 푞푖푗 and 푝푖푗 at thermal equilibrium and to change each weight and each threshold with formulae:

Δ푤̂푖푗 = −훿 ∙ (푞푖푗 − 푝푖푗) (5.38)

Δ휃푖 = 훿 ∙ (푞푖 − 푝푖) (5.39) where 훿 is a constant learning rate.

Another possibility is to incorporate the annealing process into the learning rate as follows:

1 훿 = (5.40) 퐓

We end this section with a high–level pseudocode of the generic learning algorithm. We present this algorithm by emphasizing the aspects related to the learning rule we derived. At this time we do not go into details regarding the update process (specifically how the units are selected for update) and the collection of statistics. However, we mention that, for a given pattern, the selection of a unit to be updated is in principle similar with the selection performed in a Hopfield network (Section 2.5.1), otherwise it could be a stochastic process (taking place at a given mean rate 푟 > 0 for each unit) or a deterministic process (being part of a predefined sequence). In Section 5.3.3 we will present various strategies to update the units as well as to collect the statistics 푞푖푗 and 푝푖푗.

Algorithm 5.1: Generic Boltzmann Machine Learning

Given: n x n weight matrix Ŵ ; n x 1 threshold vector Θ

(푘) a training set of 푝푎 data vectors: {푣 }1≤푘≤푝푎

the number of learning cycles: 푒푝

begin

112

Step 1: initialize the weights Ŵ and the thresholds Θ

For an arbitrary number of learning cycles:

Step 2: for e=1 to 푒푝 do

For each one of the patterns to be learned:

Step 3: for k =1 to 푝푎 do

Clamping phase:

Step 4: present and clamp the pattern 푣(푘)

UPDATE PROCESS START

Randomly pick a hidden unit to update its value:

Step 5: choose at random a hidden unit ℎ푖 from the set ℋ

Lower pseudo–temperature following a schedule:

Step 6: anneal ℎ푖

At the final pseudo–temperature estimate the correlations:

Step 7: collect statistics 푞푖푗

UPDATE PROCESS END

Free–running phase:

Step 8: present the pattern 푣(푘) but do not clamp it

UPDATE PROCESS START

Randomly pick a visible or hidden unit to update its value:

Step 9: choose at random a unit 휎푖 from the set 𝒩

Lower pseudo–temperature following a schedule:

Step 10: anneal 휎푖

At the final pseudo–temperature estimate the correlations:

Step 11: collect statistics 푝푖푗

113

UPDATE PROCESS END

Update the weights and the thresholds for any pair of

connected units 푖 ≠ 푗 such that at least one unit has been

updated:

Step 12: Δ푤̂푖푗 = −훿 ∙ (푞푖푗 − 푝푖푗) for 푖 ≠ 푗

푤̂푖푗 ← 푤̂푖푗 + Δ푤̂푖푗

Δ휃푖 = 훿 ∙ (푞푖 − 푝푖)

휃푖 ← 휃푖 + Δ휃푖

end for //k

end for //e

end

return W

The generic Boltzmann machine learning algorithm runs slowly, partly because of the time required to reach thermal equilibrium and partly because the learning is driven by the difference between two noisy variables, so these variables must be sampled for a long time at thermal equilibrium to reduce the noise [64]. If we could achieve the same simple relationships between log probabilities and weights in a deterministic system, the learning process would be much faster. We will explore this idea in Section 5.5.

5.3.2 Collecting the statistics required for learning

In the previous section we saw that the generic learning algorithm computes the difference between two expectations or statistics denoted by us as 푞푖푗 (equation (5.19)) and 푝푖푗 (equation (5.20)). To evaluate the complexity of the exact computation of these expectations, we rewrite the equation (5.19) by substituting 퐐퐓(푣, ℎ) with its definition given in Section 5.1:

114

푞푖푗 = ∑ 휎푖 ∙ 휎푗 ∙ 퐐퐓(휎) = ∑ 휎푖 ∙ 휎푗 ∙ 퐐퐓(푣, ℎ) 휎∈I𝒩 휎=(푣,ℎ)∈I𝒩

푞푖푗 = ∑ 휎푖 ∙ 휎푗 ∙ 퐐퐓(ℎ|푣) ∙ 퐪(푣) (5.41) 휎=(푣,ℎ)∈I𝒩

In the equation (5.41) 퐪(푣) can be approximated by its empirical distribution 퐪̂(푣) whose computation is tractable:

푚 1 푝푎 퐪(푣) ≅ 퐪̂(푣) = ∑ ∏ 핀풖;푣 (푘)(푣푢) (5.42) 푝푎 푘=1 푢 푢=1

(k) where: 푝푎 is the number of data vectors (patterns); {푣 }1≤k≤pa is the training set of data vectors; 핀 (푘)(푣 ) is the indicator function given by the equation (3.7); and 푚 is the number of 풖;푣푢 푢 visible units (equation (4.25)).

A simplified analysis of Algorithm 5.1 shows that the complexity of computing 푞푖푗 depends on the complexity of computing 퐐퐓(ℎ|푣), which is exponential in the number of hidden units 푙

(equation (4.25)). Also the complexity of computing 푝푖푗 depends on the complexity of computing

퐏퐓(푣, ℎ) which is exponential in the total number of units (visible and hidden) 푛 = 푚 + 푙.

Consequently, both computations (푞푖푗 and 푝푖푗) are intractable. Later in this section we will see why the analysis of Algorithm 5.1 is actually much more complicated.

Therefore, in order to compute the parameters of the network, an approximation of the environmental distribution and estimations of both the environmental distribution and the true distribution are necessary. We have already seen how the approximation of the environmental distribution is performed (equation (5.42)). Now we concentrate our attention on the estimation tasks. Both estimation tasks can be accomplished by using the MCMC framework. In essence, a MCMC algorithm performs sampling from a probability distribution by constructing a Markov chain that has the desired distribution as its equilibrium distribution. Then, after a number of steps, the algorithm uses the state of the chain as a sample of the desired distribution.

In this section we present three categories of sampling algorithms that are used to estimate the data–dependent statistics 푞푖푗 and/or the data–independent statistics 푝푖푗 in a Boltzmann machine: Gibbs sampling, persistent Markov chains, and contrastive divergence.

115

5.3.2.1 Gibbs sampling

Hinton and Sejnowski used a MCMC sampling approach for estimating the data–dependent statistics 푞푖푗 by clamping a training vector on the visible units, initializing the hidden units to random binary states, and using sequential Gibbs sampling of the hidden units to approach the posterior distribution [11,43]. They estimated the data–independent statistics 푝푖푗 in the same way, but with the randomly initialized visible units included in the sequential Gibbs sampling [11,43].

In Gibbs sampling, each variable draws a sample from its posterior distribution given the current states of the other variables [65]. Before explaining how Gibbs sampling actually works, we recall the notation X−i and its meaning as the set of all the random variables from X except 푋푖

(3.73). Given a joint probability distribution 퐏 of 푛 random variables X = (푋1, 푋2, … , 푋푛), Gibbs sampling of 퐏 is done through a sequence of 푛 sampling sub–steps of the following form such that the new value for 푋푖 is used straight away in subsequent sampling steps:

푋푖~퐏(푋푖 | X−i = 푥−푖) or: for 1 ≤ 푖 ≤ 푛 (5.43) (푡) (푡+1) 1, if 푢 ≤ 퐏(푋푖 | X−i = 푥−i) 푋푖 = { 0, otherwise where 푥−푖 represents the evidence of the corresponding random variables X−i and 푢 is a sample from a uniform distribution 풰[0,1]. After these 푛 samples have been obtained, a step of the chain is completed, yielding a sample of X whose distribution converges to 퐏(X) as the number of steps goes to ∞, under some conditions. A sufficient condition for convergence of a finite– state Markov chain is that it is aperiodic and irreducible (see Theorem C.7 in Appendix C).

Let consider a configuration 휎 = (푣, ℎ) of 퐁퐌 and denote by 휎−i the set of values associated with all units except the ith unit. In order to perform Gibbs sampling in 퐁퐌, we need to compute and sample from 퐏(휎푖|휎−i) as follows:

exp(−퐸(휎 , 휎 )) 퐏(휎) = 퐏(휎 , 휎 ) = 푖 −i 푖 −i 푍

exp (−퐸(휎푖 = 1, 휎−i)) 퐏(휎푖 = 1|휎−i) = exp(−퐸(휎푖 = 1, 휎−i)) + exp (−퐸(휎푖 = 0, 휎−i))

1 퐏(휎 = 1|휎 ) = 푖 −i exp (−퐸(휎 = 0, 휎 )) 1 + 푖 −i exp (−퐸(휎푖 = 1, 휎−i))

116

1 퐏(휎푖 = 1|휎−i) = 1 + exp (퐸(휎푖 = 1, 휎−i) − 퐸(휎푖 = 0, 휎−i))

1 퐏(휎푖 = 1|휎−i) = 1 + exp (− ∑푗∈𝒩 휎푗 ∙ 푤̂푗푖 + 휃푖)

퐏(휎푖 = 1|휎−i) = sigm(∑ 휎푗 ∙ 푤̂푗푖 − 휃푖) (5.44) 푗∈𝒩

Our task is to use formula (5.44) to sample in the positive phase ℎ from 퐐퐓(ℎ|푣) and to sample in the negative phase both 푣 and ℎ from 퐏퐓(푣, ℎ), i.e., 퐏퐓(푣|ℎ) and 퐏퐓(ℎ|푣).

Therefore, in the positive phase we run a Markov chain for 퐐퐓 and sample ℎ according with the following formula:

퐐퐓(ℎ푖 = 1|푣, ℎ−푖) = sigm (∑ 푣푗 ∙ 푤̂푗푖 + ∑ ℎ푚 ∙ 푤̂푚푖 − 휃푖) (5.45) 푗∈풱 푚∈ℋ−{i}

In the negative phase we run a Markov chain for 퐏퐓 and sample ℎ and 푣 according with the following formulae:

퐏퐓(ℎ푖 = 1|푣, ℎ−푖) = sigm (∑ 푣푗 ∙ 푤̂푗푖 + ∑ ℎ푚 ∙ 푤̂푚푖 − 휃푖) 푗∈풱 푚∈ℋ−{i} (5.46)

퐏퐓(푣푖 = 1|푣−푖, ℎ) = sigm (∑ ℎ푗 ∙ 푤̂푗푖 + ∑ 푣푘 ∙ 푤̂푘푖 − 휃푖) 푗∈ℋ 푘∈풱−{i}

For any iteration of learning, two separate Markov chains are run for every data vector: one chain is used to estimate 푞푖푗 and another chain is run to estimate 푝푖푗. This makes the algorithm computationally expensive because, before taking samples, we must wait until each Markov chain reaches its stationary distribution and this process can require a very large number of steps without known foolproof method to determine whether equilibrium has been reached. A further disadvantage is the large variance of the estimated gradient. This means that, in general, the samples from the stationary distribution have high variance since they come from all over the model’s distribution [37].

The Markov chains used in the negative phase of Gibbs sampling do not depend on the training data; therefore, they do not have to be restarted for each new data vector 푣. This observation has been exploited in persistent MCMC estimators [39, 66].

117

5.3.2.2 Using persistent Markov chains to estimate the data–independent statistics

Neal in [14] and Tieleman in [66] proposed a different way to estimate the data–independent statistics 푝푖푗: a stochastic approximation procedure (SAP). SAP belongs to the class of stochastic approximation algorithms of the Robbins–Monro type [67-69,39].

To understand how SAP works, we assume that, for any 푖 ≠ 푗, the data dependent statistics 푞푖푗 are available to us at any time and that we maintain a set of 푀 sample points {휎(1)(푡), 휎(2)(푡), … , 휎(푀)(푡)}.

We augment our notation to include the parameters W of the network in the specification of the joint probability distribution 퐏퐓, as well in expression of the data–independent statistics 푝푖푗 (equation (5.20)):

푝푖푗(W) = ∑ 휎푖 ∙ 휎푗 ∙ 퐏퐓(휎; W) (5.47) 휎∈I𝒩

Next, we consider a 퐓퐕퐁퐌 = (𝒩, 𝒢, W) given by Definition 4.3 and let W(푡) be the parameters of the 퐓퐕퐁퐌 at the moment 푡 ∈ 풯 and 휎(푡) be the configuration of the 퐓퐕퐁퐌 at the same moment of time. Then, W(푡) and 휎(푡) are updated sequentially as follows:

 Given 휎(푡 − 1), a new state 휎(푡) is sampled from a transition operator 푇W(휎(푡 − 1) →

휎(푡)) that leaves 퐏퐓(W) invariant.

 Having W(푡 − 1) and 휎(푡), the data–independent statistics 푝푖푗 are updated according with the formulae (5.46) to reflect the changes that affected 휎 (from 푡 − 1 to 푡).

 Based on the new value of 푝푖푗, a new parameter W(푡) is obtained with the formula (5.38).

The transition operator 푇W(휎(푡 − 1) → 휎(푡)) is defined by the blocked Gibbs updates given by the formulae (5.46). Precise sufficient conditions that guarantee almost sure convergence to an asymptotically stable point are given in [68-70]. One necessary condition requires the learning rate to decrease with time, i.e.:

∞ ∞ 2 ∑ 훿푡 = ∞ and ∑ 훿푡 = 0 (5.48) 푡=0 푡=0

1 This condition can be trivially satisfied by setting 훿 = . Typically, in practice, the sequence 푡 푡

{|W(푡)|}푡∈풯 is bounded, and the Markov chain governed by the transition operator 푇W is ergodic.

118

Together with the condition on the learning rate, this ensures almost sure convergence [39]. The pseudocode of SAP is presented below.

Algorithm 5.2: Stochastic Approximation

Given: n x n weight matrix Ŵ ; n x 1 threshold vector Θ

all possible 푞푖푗 for any 푖 ≠ 푗

begin

Step 1: initialize W(0) and 푀 fantasy particles: {휎(1)(0), … , 휎(푀)(0)}

Step 2: for t=0 to 풯 do

Step 3: for 푘=1 to 푀 do

Step 4: sample 휎(푘)(푡) given 휎(푘)(푡 − 1) using transition

(푘) (푘) operator: 푇W (휎 (푡 − 1) → 휎 (푡))

end for //k

Step 5: update W(푡) = W(푡 − 1) + 훿푡 ∙ (푞푖푗 − 푝푖푗)

Step 6: decrease 훿푡

end for //t

end

The intuition behind why this procedure works is the following: as the learning rate becomes sufficiently small compared with the mixing rate of the Markov chain, this “persistent” chain will always stay very close to the stationary distribution even if it is only run for a few MCMC updates per parameter update. Samples from the persistent chain will be highly correlated for successive parameter updates, but again, if the learning rate is sufficiently small the chain will mix before the parameters have changed enough to significantly alter the value of the estimator [39]. Many persistent chains can be run in parallel. The current state of each of these chains is usually denoted as a “fantasy particle” [39].

One important observation is that the process of running persistent Markov chains to produce the data–independent statistics 푝푖푗 is interleaved with the learning process itself. Consequently,

119

the analysis of Algorithm 5.1 becomes more difficult because it cannot be viewed anymore as an outer loop for the inner loop represented by the statistics gathering.

5.3.2.3 Contrastive divergence (CD)

In [37] Hinton proposed a simple and effective alternative to maximum likelihood (ML) learning that eliminates almost all of the computation required to get samples from the equilibrium distribution and also eliminates much of the variance that masks the gradient signal. This method, named contrastive divergence, follows the gradient of a different function than ML learning.

ML learning follows the log likelihood gradient by minimizing the KL–divergence:

퐪ퟎ(푣) KL(퐪||퐩퐓) ≡ KL(퐪ퟎ||퐩퐓∞) = ∑ 퐪ퟎ(푣) ∙ ln (5.49) 퐩퐓 (푣; W) 푣∈I풱 ∞

CD learning approximately follows the gradient of the difference of two divergences [37]:

CD푘 = KL(퐪ퟎ||퐩퐓∞) − KL(퐪풌||퐩퐓∞) (5.50) where: 퐪ퟎ is the distribution over the “one–step” reconstructions of the data vectors generated by one full step of Gibbs sampling; 퐪풌 is the distribution over the “k–step” reconstructions of the data vectors generated by 푘 > 0 full steps of Gibbs sampling; and 퐩퐓∞ is the equilibrium distribution of the network. In particular, 퐪퐤 could even be 퐪ퟏ.

The CD algorithm is fueled by the contrast between the statistics collected when the input is a real training example and when the input is a chain sample [71]. The intuitive motivation behind CD is that we would like the Markov chain that is implemented by Gibbs sampling to leave the initial distribution 퐪ퟎ over the visible variables unaltered.

In CD learning, we start the Markov chain at the data distribution 퐪ퟎ and run the chain for a small number of steps. Instead of updating the parameters only after running the chain to equilibrium, we simply run the chain for 푘 full steps and update the parameters. Then we keep running the chain to equilibrium and, when there, update the parameters again. Thus, we reduce the tendency of the chain to wander away from the initial distribution after 푘 steps.

Because 퐪풌 is 푘 steps closer to the equilibrium distribution than 퐪ퟎ, we are guaranteed that

KL(퐪ퟎ||퐩퐓∞) exceeds KL(퐪퐤||퐩퐓∞) unless 퐪ퟎ equals 퐪퐤. Consequently, CD (equation (5.50)) can never be negative. Also, for Markov chains in which all transitions have nonzero probability,

120

퐪ퟎ = 퐪퐤 implies 퐪ퟎ = 퐩퐓∞, because, if the distribution does not change at all on the first step, it must already be at equilibrium, so CD can be zero only if the model is perfect.

In [72] Carreira–Perpinan and Hinton showed that, in general, CD provides biased estimates. However the bias seems to be small, as their experiments of comparing CD and ML have showed. They also showed that, for almost all data distributions, the fixed–points of CD are not fixed–points of ML, and vice versa. Finally, they proposed a new and more effective approach to collect statistics for Boltzmann machine learning, i.e., to use CD to perform most of the learning followed by a short run of ML to clean up the solution.

CD learning is well–suited for Restricted Boltzmann Machines due to the fact that they are the only Boltzmann machines with tractable conditional distributions 퐏퐓(ℎ|푣) and 퐐퐓(ℎ|푣).

5.4 The equilibrium distribution of a Boltzmann machine

To develop an understanding of Boltzmann machines it is necessary to be able to determine the equilibrium distribution. In general, the equilibrium distribution of a stochastic process is related to the structure of the associated transition probability matrix. If the transition probabilities are known, then it becomes possible to compute the equilibrium distribution. In an asynchronous Boltzmann machine with 푛 units the transitions between configurations or so– called state transitions generate a finite Markov chain whose state space contains 2푛 global states and whose transitions between global states are performed such that only one unit may change its state at a time according with the update rule (4.34) or its temporal version (4.44). A consequence of these update rules is the implicit satisfaction of the Markov local property for the associated Markov chain. More, the transition probability matrix of the associated Markov chain can be computed from parameters of the model.

In this section we are interested in establishing the existence of a unique equilibrium distribution for a Boltzmann machine when 퐓 > 0 and in determining the behavior of a Boltzmann machine when 퐓 = 0. In presenting the chain of logical arguments that relate Markov processes and Boltzmann machines we have been inspired by the work of Viveros [42].

In order to construct the 2푛 × 2푛 transition probability matrix of the Markov chain associated to a

Boltzmann machine, we need to define the transition probability 퐏퐓(휎(푡) | 휎(푡 − 1)) of a global state transition 휎(푡 − 1) → 휎(푡). The transition probability matrix 퐏퐓(휎(푡) | 휎(푡 − 1)) is rather a

121

sparse matrix. If the updating sequence is random, the matrix will have 푛 + 1 non–zero entries per row. One of the non–zero entries is in the diagonal and represents the probability that the particular unit selected for updating does not change its state, otherwise the configuration 휎 remains unchanged. Each of the other non–zero entries represents the probability that the corresponding unit will change its state according with the update rule (4.34) or its temporal version (4.44), otherwise they correspond to possible transitions to one of the 푛 states that differs from 휎 by a change in the state of a single unit [42].

However, if the updating sequence is more orderly and each unit has a predetermined turn to update, the number of non–zero entries per row decreases further. For example, one way to update a layered Boltzmann machine is layer–by–layer and sequentially inside a layer. Thus, in each row of the transition probability matrix there will be exactly two components that are not zero: one to the “left” and one to the “right” of the unit to be updated.

Therefore, different updating regimes lead to different transition probability matrices and consequently to different dynamics. Before we define the transition probability matrix

퐏퐓(휎(푡) | 휎(푡 − 1)), we recall its source of inspiration that is the transition probability for only one unit in one time step given by the formula (4.44). The formula (4.44) also shows that, in one time step, every transition from one configuration to another configuration that differs from first configuration in at most one position has non–zero probability.

Definition 5.3:

The transition probability 퐏퐓(휎(푡) | 휎(푡 − 1)) of a global state transition 휎(푡 − 1) → 휎(푡) in an asynchronous Boltzmann machine is given by the formula (5.51) which is the following:

1 if ∃ only one 푖 such that 휎 (푡) ≠ 휎 (푡 − 1) −(2 ∙ 휎 (푡) − 1) ∙ net (푖, 휎(푡 − 1)) 푖 푖 1 + exp ( 푖 t ) 퐓 퐏퐓(휎(푡) | 휎(푡 − 1)) = 1 − ∑ 퐏 (휌(푡) | 휎(푡 − 1)) if 휎 (푡 − 1) = 휎 (푡) for all 푖 퐓 푖 푖 휌(푡)≠휎(푡) { 0 otherwise

We could obtain a more readable form of the formula (5.51) if we use the following notations:

휎(푡) ≡ 휎

휎(푡 − 1) ≡ 휏

Then the transition matrix 퐏퐓(휎(푡) | 휎(푡 − 1)) given by (5.51) becomes 퐏퐓(휎|휏) given by (5.52):

122

1 if ∃ only one 푖 such that 휎 ≠ 휏 −(2 ∙ 휎 − 1) ∙ net(푖, 휏) 푖 푖 1 + exp ( 푖 ) 퐓 퐏퐓(휎|휏) = (5.52) 1 − ∑ 퐏퐓(휌|휏) if 휎푖 = 휏푖 for all 푖 휌≠휎 { 0 otherwise

In order to establish the existence or non–existence of an equilibrium distribution of a Markov chain in terms of the properties of its associated transition probability matrix, we take into consideration the fact that the transition probability matrix is stochastic, otherwise it is a special case of non–negative matrices. Furthermore, the Perron–Frobenius theorem [73-74] details the precise range of possibilities for the eigenvalues and eigenvectors of non–negative, irreducible matrices. This result is of particular importance for us because the properties of the eigenvalues of the transition probability matrix are the only factors that influence the nature and existence of an equilibrium distribution for a Markov process [42].

We introduce the concepts of aperiodicity and irreducibility. A finite Markov chain is aperiodic if no state of it is periodic with period 푘 > 1; a state has period 푘 if one can only return to it at times 푡 + 푘, 푡 + 2푘, etc. A finite Markov chain is irreducible if one can reach any state from any state in finite time with non–zero probability. We provide without proof the following theorem that asserts the existence and unicity of the equilibrium distribution for a category of transition probability matrices that includes, as we shall see, the transition probability matrices associated to asynchronous Boltzmann machines.

Theorem 5.5 (Existence of a unique equilibrium distribution for stochastic, irreducible, aperiodic matrices)

Let 퐏 be a 푑푥푑 stochastic irreducible and aperiodic matrix. Then the equilibrium distribution for the associated Markov process exists and is given by the left eigenvector corresponding to the eigenvalue λ. Moreover 휆 = 1 since 퐏 is stochastic.

The proof of this theorem can be found in [42]. In order to establish the existence of a unique equilibrium distribution for an asynchronous Boltzmann machine, we are looking at the irreducibility and aperiodicity properties of the transition probability matrix. Two cases are in essence considered: when the pseudo–temperature is strictly positive and when the pseudo–

123

temperature is zero. Moreover, within the case 퐓 > 0, the scenario 퐓 → 0 demands special consideration. Therefore, we treat it separately. Table 2 gives a classification of all possible transition probability matrices for asynchronous Boltzmann machines and how they position themselves with respect to irreducibility and aperiodicity.

Table 2 Transition probability matrices for asynchronous symmetric Boltzmann machines 퐓 > 0 퐓 = 0 Irreducible and aperiodic Irreducible and periodic Reducible Equilibrium distribution No equilibrium distribution

 퐓 > 0

We want to prove the existence of an equilibrium distribution for asynchronous symmetric Boltzmann machines. To accomplish this, first we prove a few helper lemmas and theorems, then we prove the most significant result of this section: Theorem 5.10.

Lemma 5.6:

Given an asynchronous Boltzmann machine with 푛 units, any state can be visited from any other state with positive probability in at most 푛 steps.

Proof: Each configurations of a Boltzmann machine with 푛 units can be viewed as a vector of length 푛, each component being the state of a unit. In any asynchronous Boltzmann machine one unit updates at each time step and, since 퐓 > 0, the update is described by the equation (4.44). More, a unit has a positive probability of changing its state.

A consequence of the update rule is that the Hamming distance between two successive configurations is no greater than 1. Thus, any two configurations are at most Hamming distance 푛 apart, and so the worst case requires 푛 units change state. Therefore there is a positive probability that this can occur [42].

Theorem 5.7:

The transition probability matrix (5.51) – (5.52) of an asynchronous Boltzmann machine is irreducible.

124

푛 푛 Proof: Let 퐏 = (푝푖푗)ퟏ≤풊,풋≤2푛 be the 2 × 2 transition probability matrix of 퐏퐓(휎(푡) | 휎(푡 − 1)) given by (5.51). From Lemma 5.6 we have that any configuration has a positive probability of being visited from any other configuration in at most 푛 steps. For every configuration 휎 ∈ I𝒩 we denote by 푢 the index of the row corresponding to 휎 in 퐏. Therefore, according to the definition of an irreducible matrix, for every pair of configurations 휎, 휏 ∈ I𝒩 identified by their row indices 푢 (푟) respectively 푣, there exist an integer 푟 ≥ 1 such that 푝푢푣 > 0, which in fact is 푟 = 푛. Hence the transition probability matrix 퐏퐓 defined by (5.51) is an irreducible matrix [42].

Theorem 5.8:

The transition probability matrix (5.51) – (5.52) of an asynchronous Boltzmann machine is aperiodic.

𝒩 Proof: For any given configuration 휎 ∈ I we shall prove that 퐏퐓(휎|휎) > 0. Given any configuration 휎 there are 푛 + 1 possible transitions, one of them being to remain in the current configuration. The 푛 other possible transitions lead to configurations with a Hamming distance to 휎 equal 1. Let 휏 be one of these 푛 configurations. The probability that the ith unit of the configuration 휎 outputs 휏푖 is given by the equation (4.44):

1 퐏(휏 |휎) = 푖 −(2 ∙ 휏 − 1) ∙ net(푖, 휎) 1 + exp ( 푖 ) 퐓

When 퐓 > 0, we have 0 < 퐏(휏푖|휎) < 1 for all these 푛 configurations. The explanation is that for any other configuration 휌, that means it has a Hamming distance to 휎 greater than 1, the probability 퐏(휌푖|휎) is zero.

Now we need to take into consideration another probability which we completely ignored until now because it has no direct effect on learning. This is the probability that the ith unit is selected for update. Usually this probability distribution is a uniform distribution over the set of units, which means that the probability of the ith unit to be selected for update is 1. This probability 푛 distribution is imposed by the environment, so it is independent of 퐏.

These being said, we can compute 퐏(휎|휎):

1 1 퐏(휎|휎) = 1 − ∑ ∙ 퐏(휏 | 휎) = 1 − ∙ ∑ 퐏(휏 | 휎) > 0 푛 푖 푛 푖 휏∈I𝒩−{휎}, 휏∈I𝒩−{휎}, 푖∈𝒩 푖∈𝒩

125

Hence, for all 휎 ∈ I𝒩 we have that 퐏(휎|휎) > 0. According with Lemma C.1 in Appendix C, we conclude that the transition probability matrix defined by (5.51) – (5.52) is aperiodic.

We remark that in this proof we have also proved that the transition probability matrix given by (5.51) – (5.52) is reflexive [42].

Theorem 5.9:

For the asynchronous transition probability matrix defined by (5.51) – (5.52) the equilibrium distribution exists and is given by the left eigenvector corresponding to the eigenvalue 휆 = 1.

Proof: When 퐓 > 0 the transition probability matrix given by (5.51) – (5.52) is stochastic. From Theorem 5.7 and Theorem 5.8 we know that this transition probability matrix is also irreducible and aperiodic. Therefore, from Theorem 5.5 there exists a unique equilibrium distribution given by the left eigenvector of the matrix given by (5.51) – (5.52) corresponding to the eigenvalue 휆 = 1 [42].

We have proved that in general for any weight matrix the equilibrium distribution of an asynchronous Boltzmann machine with 퐓 > 0 is given by the left eigenvector of the transition matrix (5.51) – (5.52) corresponding to the eigenvalue 휆 = 1. If the system is allowed to stabilize at a given pseudo–temperature, we have proved the existence of an equilibrium distribution. If we slowly lower the pseudo–temperature, allowing the system to restabilize at the new equilibrium distribution, then as 퐓 → 0 the distribution will tend to a uniform distribution over the optimal set of configurations 휎, which we call 푂푝푡. Roughly speaking, the asynchronous Boltzmann machine converges asymptotically to the set of globally optimal states 푂푝푡 ⊆ I𝒩 that minimize the energy function given by (4.31) [42]. The following theorem states these facts more formally.

Theorem 5.10 (Asynchronous weight–symmetric equilibrium distribution):

Let the transition probabilities in a Boltzmann machine be given by (5.51) – (5.52). Then:

 There exists a unique equilibrium distribution 퐏퐓(휎) for all 퐓 > 0 whose components are given by:

126

퐸(휎) exp (− ) 퐓 퐏퐓(휎) = lim 퐏퐓(휎(푘) = 휎) = 푘→∞ 푍(퐓) where: (5.53) −퐸(휏) 푍(퐓) = ∑ exp ( ) 퐓 흉∈I𝒩

 As 퐓 → 0 the stationary distribution converges to a uniform distribution over the set of optimal states, i.e.:

1 lim ( lim 퐏퐓(휎(푘) = 휎)) = ∙ 핀푂푝푡(휎) (5.54) 퐓→0 푘→∞ |푂푝푡|

where 핀푂푝푡 is the characteristic function of 푂푝푡, i.e. 핀푂푝푡 takes the value one for any 휏 ∈ 푂푝푡 and zero elsewhere.

The first part asserts that the equilibrium distribution at any pseudo–temperature 퐓 is the Boltzmann–Gibbs distribution. The second part implies that as 퐓 → 0 the equilibrium distribution tends to a uniform distribution over the set 푂푝푡 of minimal energy states 휎.

Proof: From Theorem 5.9 the transition probability matrix defined by (5.51) – (5.52) has a unique equilibrium distribution given by its left eigenvector and the corresponding eigenvalue 휆 = 1.

Our approach is to use Proposition C.5 from Appendix C to prove that 퐏퐓(휎) is the equilibrium 𝒩 distribution. Specifically, if for all 휎, 휏 ∈ I there exists numbers 퐏퐓(휎), 퐏퐓(휏) such that:

퐏퐓(휏|휎) ∙ 퐏퐓(휎) = 퐏퐓(휎|휏) ∙ 퐏퐓(휏) (5.55) then 퐏퐓(휎) is the equilibrium distribution. We have to show that 퐏퐓(휎) given by (5.53) together with 퐏퐓(휎|휏) given by (5.52) satisfy (5.54). We distinguish two cases:

1. if 휎 = 휏, then (5.54) is satisfied by the definitions (5.52) and (5.53).

2. if 휎 ≠ 휏, then we have:

퐏퐓(휎) ∙ 퐏퐓(휏|휎) = 퐏퐓(휎) ∙ 퐏(휏푖|휎)

퐸(휎) exp (− ) 퐓 1 퐏 (휎) ∙ 퐏 (휏|휎) = ∙ (5.56) 퐓 퐓 푍(퐓) −(2 ∙ 휏 − 1) ∙ net(푖, 휎) 1 + exp ( 푖 ) 퐓

By multiplying the right hand side of (5.56) by:

127

(2 ∙ 휏 − 1) ∙ net(푖, 휎) −(2 ∙ 휏 − 1) ∙ net(푖, 휎) exp ( 푖 ) ∙ exp ( 푖 ) = 1 퐓 퐓 we obtain:

퐸(휎) (2 ∙ 휏 − 1) ∙ net(푖, 휎) exp (− ) exp ( 푖 ) 퐓 퐓 퐏 (휎) ∙ 퐏 (휏|휎) = ∙ (5.57) 퐓 퐓 푍(퐓) (2 ∙ 휏 − 1) ∙ net(푖, 휎) 1 + exp ( 푖 ) 퐓

The asynchronous Boltzmann machine is symmetric, so the equation (4.51) holds:

net(푖, 휎) ∙ Δ휎푖 = net(푖, 휎) ∙ (휏푖 − 휎푖) = −Δ퐸푖 = 퐸(휎) − 퐸(휏)

The equations (4.47) and (4.49) also hold for 휎 and 휏 because their Hamming distance is 1.

net(푖, 휏) = net(푖, 휎)

휏푖 = 1 − 휎푖

We rewrite (5.57) by substituting 휏푖 as specified above and using the equality between the net inputs net(푖, 휏) and net(푖, 휎). We obtain:

퐸(휏) + net(푖, 휎) ∙ Δ휎 (−2 ∙ 휎 + 1) ∙ net(푖, 휏) exp (− 푖) exp ( 푖 ) 퐓 퐓 퐏 (휎) ∙ 퐏 (휏|휎) = ∙ 퐓 퐓 푍(퐓) (−2 ∙ 휎 + 1) ∙ net(푖, 휏) 1 + exp ( 푖 ) 퐓

퐸(휏) net(푖, 휎) ∙ (휏 − 휎 ) (2 ∙ 휎 − 1) ∙ net(푖, 휏) exp (− ) ∙ exp (− 푖 푖 ) exp (− 푖 ) 퐓 퐓 퐓 퐏 (휎) ∙ 퐏 (휏|휎) = ∙ 퐓 퐓 푍(퐓) −(2 ∙ 휎 − 1) ∙ net(푖, 휏) 1 + exp ( 푖 ) 퐓

퐸(휏) exp (− ) 퐓 1 net(푖, 휎) ∙ (휏 + 휎 − 1) 퐏 (휎) ∙ 퐏 (휏|휎) = ∙ ∙ exp (− 푖 푖 ) 퐓 퐓 푍(퐓) −(2 ∙ 휎 − 1) ∙ net(푖, 휏) 퐓 1 + exp ( 푖 ) 퐓

퐏퐓(휎) ∙ 퐏퐓(휏|휎) = 퐏퐓(휏) ∙ 퐏퐓(휎|휏) ∙ exp(0) = 퐏퐓(휏) ∙ 퐏퐓(휎|휏)

Consequently, according with Proposition C.5 from Appendix C, 퐏퐓(휎) is the equilibrium distribution.

We omit the details of the proof for the second part of Theorem 5.10. However, as mentioned by Viveros [42], apart from a change of notation, the proof is exactly Theorem 8.1 p.134 and Corollary 2.1 p. 18 of [75].

128

The hypothesis of symmetric weights in a Boltzmann machine simplifies the case 퐓 > 0 a lot because it enables one to infer that the detailed balance condition holds, and this leads immediately to simple closed formulae for the equilibrium distributions [42].

 퐓 → 0

In the limiting case when 퐓 → 0 the transition probability matrix for asynchronous Boltzmann machines is given by taking the limit when 퐓 → 0 of the equation (5.52).

푔(푖) lim if ∃ only one 푖 such that 휎푖 ≠ 휏푖 퐓→0 −(2 ∙ 휎 − 1) ∙ net(푖, 휏) 1 + exp ( 푖 ) 퐓 퐏퐓(휎|휏) = (5.58) 1 − lim ∑ 퐏퐓(휌|휏) if 휎푖 = 휏푖 for all 푖 퐓→0 휌≠휎 { 0 otherwise where 푔(푖) denotes the environmental distribution used for selection of the unit 푖 to be updated. 1 Usually 푔 is a uniform distribution, so the probability of the ith unit to be selected for update is . 푛

The limiting behavior of the transition probability matrix as 퐓 → 0 is determined by the limiting transition probability for a particular unit 푖:

0 if Δ퐸 = −(2 ∙ 휎푖 − 1) ∙ net(푖, 휏) > 0 1 1 lim = { if Δ퐸 = −(2 ∙ 휎 − 1) ∙ net(푖, 휏) = 0 (5.59) 퐓→0 −(2 ∙ 휎 − 1) ∙ net(푖, 휏) 푖 1 + exp ( 푖 ) 2 퐓 1 if Δ퐸 = −(2 ∙ 휎푖 − 1) ∙ net(푖, 휏) < 0

The equation (5.59) gives the limiting transition probability 휏푖 → 휎푖 as 퐓 → 0 for a particular unit 푖 in terms of its activation or net input. We learned that, in principle, there are 푛 + 1 non–zero entries in any row of the 2푛 × 2푛 transition probability matrix. But is it possible that some of these entries become zero? If that happens, what effect has on the transition probability matrix as 퐓 → 0? To answer these questions, suppose that a configuration 휏 has 푘 units with zero activation, where 1 ≤ 푘 ≤ 푛:

∃ 푖(1), 푖(2), … , 푖(푘) such that net(푖(푗), 휏) = 0 for any 1 ≤ 푗 ≤ 푘 (5.60)

In the transition 휏 → 휎, the units 휎푖(1), … , 휎푖(푘) can take values 0 or 1 with equal probability. However, because of the zero activation of these units, the limit (5.59) corresponding to them is 1 always . Thus, as 퐓 → 0, in the row corresponding to 휏 in the 2푛 × 2푛 transition probability 2푛

129

1 matrix there will be 푘 ≤ 푛 non–zero entries having the value and the remainder of entries 2푛 푘 must sum to 1 − [42]. 2푛

 퐓 = 0

The case 퐓 = 0 is a special limiting case for which the dynamics become progressively more deterministic and the model approaches the Hopfield model. Provided that only one unit updates at a time, the network settles to a local energy minima similarly to a Hopfield network (see Section 2.5.2). Here there might be cycles or fixed–points but, in the strict sense, an equilibrium distribution does not exist. Since the networks are deterministic, their “transition probability matrices” are not transition probability matrices in the strict sense. However, these “transition probability matrices” can be either reducible or irreducible and periodic [42].

5.5 Learning algorithms based on variational approaches

In Section 5.3 we learned how to find the parameters of the Boltzmann Machine Learning problem by means of maximum likelihood estimation. In order to pick a tractable variational parameterization for a Boltzmann machine, in this section we use a different approach, specifically a variational approach. The true probability distribution of a Boltzmann machine cannot be computed exactly, regardless of the form is expressed in: joint, conditional, or marginal. The goal of variational learning a Boltzmann machine is to approximate the true conditional probability of the hidden variables given the data vectors over the visible variables and to use it in the learning rules (5.38) and (5.39) to replace the data–dependent statistics.

Here we recall a consequence of Theorem 5.1, specifically the equation (5.12), that is that, in a Boltzmann machine with visible units 푣 clamped and with hidden units ℎ, the subnet ℋ itself behaves like a Boltzmann machine with its own interconnecting weights Ŵ and thresholds

(훉퐢)푖∈ℋ. Otherwise, the only effect of 푣 on ℎ is to cause the hidden units ℎ to run with effective thresholds 훉퐣 given by the equation (5.10) instead of their regular thresholds 휃푗. Therefore, the true conditional probability distribution 퐏퐓(ℎ|푣) is governed by the following Boltzmann–Gibbs distribution:

130

exp(−퐸ℋ(ℎ|푣)) 퐏퐓(ℎ|푣) = 퐏ℋ(ℎ|푣) = 푍ℋ

(5.61)

exp (∑푗∈ℋ ℎ푗 ∙ ∑푖∈ℋ ℎ푖 ∙ 푤̂푖푗 − ∑푗∈ℋ 훉퐣 ∙ ℎ푗) 푖<푗 퐏퐓(ℎ|푣) = 푍ℋ where:

푍ℋ = ∑ exp ∑ ℎ푗 ∙ ∑ ℎ푖 ∙ 푤̂푖푗 − ∑ 훉퐣 ∙ ℎ푗 (5.62) ℎ∈Iℋ 푗∈ℋ 푖∈ℋ 푗∈ℋ ( 푖<푗 )

We augment the notations of the true probability distribution 퐏퐓 to include the parameters of the network. In this section we are specifically interested in the parameters corresponding to a mean parameterization of a pairwise Markov network (equations (3.11), (3.12), and (3.15)):

퐏퐓(푣, ℎ) = 퐏퐓(푣, ℎ; μ) and 퐏퐓(ℎ|푣) = 퐏퐓(ℎ|푣; μ) (5.63)

5.5.1 Using variational free energies to compute the statistics required for learning

In this section we establish the connection between the Boltzmann machine variational learning and the approximations of the free energies discussed in Chapter 3.

As mentioned in the previous section, variational learning is concerned with the true conditional distribution 퐏퐓(ℎ|푣). Concretely, variational learning means that we have to choose a conditional probability distribution 퐐(ℎ|푣) from a family ℚ(ℎ|푣; 휆) of approximating conditional probability distributions that are described by the variational parameters 휆. Generally, the Markov network representing 퐐 is not the same as the Markov network representing 퐏퐓 but rather a sub–graph of it.

From the family of approximating distributions ℚ(ℎ|푣; 휆), we choose a particular distribution 퐐 by minimizing the KL–divergence KL(ℚ||퐏퐓) given by the equation (3.25) with respect to the variational parameters 휆. Then, the particular distribution 퐐(ℎ|푣; 휆∗) that corresponds to the values 휆∗ of the variational parameters that resulted from the KL–divergence minimization is considered the best approximation of 퐏퐓(ℎ|푣) in the family ℚ(ℎ|푣; 휆).

퐐 = ℚ(ℎ|푣; 휆∗) where: (5.64)

131

∗ 휆 = arg min KL(ℚ(ℎ|푣; 휆) || 퐏퐓(ℎ|푣)) 휆

One simple justification for using the KL–divergence as a measure of approximation accuracy is that it yields the best lower bound on the probability of the evidence 퐩퐓(푣) in the family of approximations 퐐(ℎ|푣; 휆). The probability of the evidence is the same as the probability distribution over the visible units.

To prove this claim, we first recall a form of Jensen’s inequality used the in context of probability theory: if X is a random variable and φ is a convex function, then:

φ(퐄[X]) ≤ 퐄[φ(X)] (5.65)

Thus, if we bound the log likelihood, i.e., the logarithm of the probability of the evidence, using Jensen’s inequality we obtain:

퐏 (푣, ℎ) ln 퐩 (푣) = ln MARG(퐏 , 풱) = ln ∑ 퐏 (푣, ℎ) = ln ∑ 퐐(ℎ|푣) ∙ 퐓 퐓 퐓 퐓 퐐(ℎ|푣) ℎ∈Iℋ ℎ∈Iℋ

퐏 (푣, ℎ) 퐏 (푣, ℎ) ln ∑ 퐐(ℎ|푣) ∙ 퐓 ≥ ∑ 퐐(ℎ|푣) ∙ ln 퐓 퐐(ℎ|푣) 퐐(ℎ|푣) ℎ∈Iℋ ℎ∈Iℋ

퐏 (푣, ℎ) ln 퐩 (푣) ≥ ∑ 퐐(ℎ|푣) ∙ ln 퐓 퐓 퐐(ℎ|푣) (5.66) ℎ∈Iℋ

The inequation (5.66) can be interpreted in this way: its right–hand side is a lower bound of its left–hand side, which means that we found a lower bound for ln 퐩퐓(푣). More, the difference between the left–hand side and the right–hand side of the inequation (5.66) is exactly the KL– divergence KL(퐐(ℎ|푣) || 퐏퐓(ℎ|푣)) [27]:

퐏 (푣, ℎ) KL(퐐(ℎ|푣) || 퐏 (ℎ|푣)) = ln 퐩 (푣) − ∑ 퐐(ℎ|푣) ∙ ln 퐓 ≥ 0 퐓 퐓 퐐(ℎ|푣) (5.67) ℎ∈Iℋ

∗ Hence, by choosing 휆 according to (5.64), we obtain the tightest lower bound for ln 퐩퐓(푣) [27]:

퐏 (푣, ℎ) KL(퐐(ℎ|푣; 휆∗) || 퐏 (ℎ|푣)) = ln 퐩 (푣) − ∑ 퐐(ℎ|푣; 휆∗) ∙ ln 퐓 퐓 퐓 퐐(ℎ|푣; 휆∗) ℎ∈Iℋ (5.68) 퐏 (푣, ℎ) ln 퐩 (푣) ≥ ∑ 퐐(ℎ|푣; 휆∗) ∙ ln 퐓 퐓 퐐(ℎ|푣; 휆∗) ℎ∈Iℋ

Furthermore, Theorem 3.1 taught us that the KL–divergence of two probability distributions 퐐 and 퐏 is related to the variational free energy of 퐐 respectively the energy functional 퐹[퐏̃, 퐐]:

132

KL(퐐(ℎ|푣) ||퐏퐓(ℎ|푣)) = −퐹[퐏̃(ℎ|푣), 퐐(ℎ|푣)] + ln 푍(퐏퐓(ℎ|푣)) (5.69) KL(퐐(ℎ|푣) ||퐏퐓(ℎ|푣)) = 퐹[퐐(ℎ|푣)] + ln 푍(퐏퐓(ℎ|푣)) where: 퐹[퐏̃, 퐐] is the energy functional of 퐏(ℎ|푣) and 퐐(ℎ|푣); 퐹[퐐] is the variational free energy of 퐐(ℎ|푣); and 푍(퐏퐓(ℎ|푣)) is the partition function of the conditional of the true probability distribution 퐏퐓(ℎ|푣).

Using (5.69), the KL–divergence employed by (5.64) can be written as:

KL(ℚ(ℎ|푣; 휆)||퐏퐓(ℎ|푣)) = 퐹[ℚ(ℎ|푣; 휆)] + ln 푍(퐏퐓(ℎ|푣)) (5.70)

Using (5.70) and the fact that the true probability distribution 퐏퐓 doesn’t depend on the variational parameter 휆, the optimization problem (5.64) can be reformulated as:

퐐 = ℚ(ℎ|푣; 휆∗) where: (5.71) ∗ 휆 = arg min{퐹[퐐(ℎ|푣; 휆)] + ln 푍(퐏퐓(ℎ|푣))} = arg min{퐹[퐐(ℎ|푣; 휆)]} 휆 휆

The optimization problem (5.71) shows the connection between variational Boltzmann machine learning and variational free energies and, in the same time, suggests a path to follow in a learning algorithm. When the mean field free energy 퐹푀퐹[퐐] plays the role of the free energy 퐹[퐐] in (5.70), the learning algorithm is called naïve mean field learning. When the Bethe–Gibbs free energy 퐺훽[퐐] plays the role of the free energy 퐹[퐐] in (5.70), the learning algorithm is called belief optimization learning.

Variational approaches like the mean field approximation and the Bethe approximation can be used in Boltzmann machine learning only in the positive phase. These variational approximations and, generally, any variational approach cannot be used in the negative phase because the minus sign in the Boltzmann machine learning rules would cause variational learning to change the parameters so as to maximize the divergence between the approximating and true distributions instead of minimizing it [39]. Therefore, the data– independent expectations should still be estimated by using a sampling algorithm like Algorithm 5.2.

133

5.5.2 Learning by naïve mean field approximation

In the naïve mean field approximation, we try to find a factorized distribution 퐐(ℎ|푣) that best describes the true posterior distribution 퐏퐓(ℎ|푣). The true posterior distribution 퐏퐓(ℎ|푣) is replaced by an approximate posterior 퐐(ℎ|푣) and the parameters of the network are updated to follow the gradient of the KL–divergence between 퐐(ℎ|푣) and 퐏퐓(ℎ|푣).

The particular distribution we choose for 퐐(ℎ|푣; μ) is the most general factorized distribution for binary variables, which has the form:

ℎ푖 1−ℎ푖 퐐퐌퐅(ℎ|푣; μ) = ∏ 휇푖 ∙ (1 − 휇푖) (5.72) 푖∈ℋ where μ = {휇푖}i∈ℋ are the variational parameters and the product is taken over the configurations of the hidden units.

In order to form the KL–divergence between the fully factorized 퐐퐌퐅 distribution and the 퐏퐓 distribution given by the equation (5.61), we use the fact that, under the distribution 퐐퐌퐅, ℎ푖 and

ℎ푗 are independent random variables with mean values 휇푖 respectively 휇푗. Thus, we obtain:

KL(퐐퐌퐅(ℎ|푣; μ)||퐏퐓(ℎ|푣)) = ∑[휇푖 ∙ ln 휇푖 + (1 − 휇푖) ∙ ln(1 − 휇푖)] − 푖∈ℋ (5.73) − ∑ 휇푖 ∙ ∑ 휇푗 ∙ 푤̂푗푖 + ∑ 훉퐢 ∙ 휇푖 + ln 푍ℋ 푖∈ℋ 푗∈ℋ 푖∈ℋ 푗<푖

In order to derive the learning rule for the mean field learning algorithm, we employ the same approach we used for the generic learning algorithm, that is we minimize the KL–divergence

KL(퐐퐌퐅(ℎ|푣) || 퐏퐓(ℎ|푣)). Concretely, we derive the mean field fixed–point equations by taking the gradient of the KL–divergence given by the equation (5.73) with respect to 휇푖 for all 푖 ∈ ℋ.

We note that 푍ℋ doesn’t depend on variational parameters. Thus, we obtain:

휕KL(퐐퐌퐅(ℎ|푣; μ)||퐏퐓(ℎ|푣)) 휇푖 = − ∑ 휇푗 ∙ 푤̂푗푖 + 훉퐣 + ln (5.74) 휕휇푖 1 − 휇푖 푗∈ne(푖) where ne(푖) denotes the Markov blanket of unit 푖.

If we equate (5.74) to zero then we obtain the “mean field fixed–point equations”:

휇푖 = sigm ( ∑ 휇푗 ∙ 푤̂푗푖 − 훉퐢) for all 푖 ∈ ℋ (5.75) 푗∈ne(푖)

134

The equations (5.75) are solved iteratively for a fixed–point solution. Note that each variational parameter 휇푖 updates its value based on a sum across the variational parameters 휇푗 within its Markov blanket.

In Section 3.4 we learned how to solve the naïve mean field approximation problem by using the type of optimization “maximize the energy functional”. In this section we have learned how to solve the same problem by using the type of optimization “minimize KL–divergence”. As we mentioned in Section 3.2, for a given problem, these approaches are equivalent. Therefore, the equations (3.67) and (5.75) should lead to pretty much the same solutions (except a small error). More, the convergence of one set of equations implies the convergence of the other set as well. Theorem 3.7 guarantees the convergence of the mean field fixed–point equations (3.67). Consequently, the mean field fixed–point equations (5.75) are also convergent.

When the mean field fixed–point equations (5.75) are run sequentially, i.e., we fix 휇−푖 and we minimize over 휇푖, the KL–divergence is convex in 휇푖 and the corresponding equation (5.75) finds the minimum in one step. Thus, this procedure can be interpreted as coordinate descent in

{휇푖}i∈ℋ and each step is guaranteed to decrease the KL–divergence. One drawback of this procedure is that it could suffer from slow convergence or entrapment in local minima.

Alternatively, all the {휇푖}i∈ℋ can be updated in parallel, which does not have the guarantee of decreasing the cost–function at any iteration, but may converge faster. In practice, one often observes oscillatory behavior which can be counteracted by damping the updates.

Finally, one can use any gradient based optimization technique to minimize over all the nodes

{휇푖}i∈ℋ simultaneously, making sure all {휇푖}i∈ℋ remain between 0 and 1 [4].

Peterson and Anderson compared the mean field approximation to Gibbs sampling on a set of test cases and found that it ran 10–30 times faster, while yielding a roughly equivalent level of accuracy [16,27].

There are cases, however, in which the mean field approximation is known to be less accurate. For large, densely connected, weakly interacting systems the cumulative effect of all nodes behaves as a “rigid” (mean) field, which acts as an additional bias term, resulting in a factorized distribution. Also, the factorized mean field distribution is clearly unimodal, and could therefore never represent multimodal posterior distributions accurately. In particular, the KL–divergence

KL(퐐퐌퐅||퐏퐓) penalizes states with small posterior probability but non–vanishing probability under the mean field distribution much harder than the other way around. The result of this asymmetry in the KL–divergence is that the mean field distribution will choose to represent only

135

one mode, ignoring the other ones. A typical situation where we expect multiple modes in the posterior is when there is not a lot of evidence clamped on the observation nodes [4]. Consider for instance the situation when the thresholds are given by:

1 휃 = − ∑ 푤̂ 푖 2 푖푗 (5.76) 푗∈풱,j≠i in which case there is symmetry in the system – switching all the nodes (ℎ푖 → 1 − ℎ푖) leaves all the probabilities invariant. This implies that there are at least two modes. In general, we expect many more modes, and the mean field distribution can only capture one. Moreover, when the interactions are strong, we expect these modes to be concentrated on one state, with little fluctuation around them. The marginals predicted by mean field would therefore be close to either 1 or 0 (they are polarized), while the true marginal posterior probabilities are due to the symmetry [4]. One way to overcome some of the difficulties mentioned above is to use more structured variational distributions 퐐 and minimize again the KL–divergence [4].

We end this section with a high–level pseudocode of the mean field learning algorithm.

During each clamping (positive) phase an algorithm similar to Algorithm 3.1, that performs minimization instead of maximization, is executed to solve the fixed–point equations (5.75) and the solution obtained for the variational parameters {휇푖}i∈ℋ is used to approximate the data– dependent statistics. During each free–running (negative) phase an algorithm similar to

Algorithm 5.2 is executed and the data–independent statistics 푝푖푗 and 푝푖 are estimated. Then the parameters W̅ of the network are updated according with the following rules:

Δ푤̂푖푗 = −훿 ∙ (휇푖 ∙ 휇푗 − 푝푖푗) (5.77)

Δ휃푖 = 훿 ∙ (휇푖 − 푝푖) (5.78) where 훿 is the learning rate.

Algorithm 5.3: Mean Field Boltzmann Machine Learning

Given: n x n weight matrix Ŵ ; n x 1 threshold vector Θ

(푘) a training set of 푝푎 data vectors: {푣 }1≤푘≤푝푎

the number of learning cycles: 푒푝

the number of mean field steps: 푚푓

136

the number of Markov chains: 푀 begin

Step 1: initialize W(0) and 푀 fantasy particles: {휎(1)(0), … , 휎(푀)(0)}

For an arbitrary number of learning cycles:

Step 2: for e=1 to 푒푝 do

For each one of the patterns to be learned:

Step 3: for k =1 to 푝푎 do

Clamping phase:

Step 4: present and clamp the pattern 푣(푘)

START ALGORITHM 3.1

Step 5: randomly initialize: μ = {휇푖}푖∈ℋ and run 푚푓 updates

until convergence:

휇푖 = 푠igm(∑푗∈ne(푖) 휇푗 ∙ 푤̂푖푗 − θj) for all 푖 ∈ ℋ

END ALGORITHM 3.1

(푘) (푘) Step 6: set: μ = {휇푖 }푖∈ℋ

Free–running phase:

Step 7: present the pattern 푣(푘) but do not clamp it

START ALGORITHM 5.2

[…]

Step 8: collect statistics 푝푖푗

END ALGORITHM 5.2

Update the weights and thresholds for any pair of connected

units 푖 ≠ 푗 such that at least one unit has been updated:

Step 9: Δ푤̂푖푗 = −훿 ∙ (휇푖 ∙ 휇푗 − 푝푖푗) for 푖 ≠ 푗

137

푤̂푖푗 ← 푤̂푖푗 + Δ푤̂푖푗

Δ휃푖 = 훿 ∙ (휇푖 − 푝푖)

휃푖 ← 휃푖 + Δ휃푖

end for //k

end for //e

end

return W

5.6 Unlearning and relearning in Boltzmann Machines

The concept of “unlearning” in a connectionist network is closely related to the concept of “reverse learning” in neuroscience. Crick and Mitchison proposed a model of reverse learning that compares the process of dream sleeping or the REM phase of sleep to an offline computer. According to the model, we dream in order to forget and this involves a process of “reverse learning” or “unlearning” [76].

A simulation of reverse learning was performed by Hopfield, Feinstein, and Palmer [77] who independently had been studying ways to improve the associative storage capacity of simple networks of binary processors. In their algorithm an input is presented to the network as an initial condition and the system evolves by falling into a nearby local energy minimum. However, not all local energy minima represent stored information. In creating the desired minima, they accidentally create other spurious minima, and to eliminate these they use "unlearning": The learning procedure is applied with reverse sign to the states found after starting from random initial conditions. Following this procedure, the performance of the system in accessing stored states was found to be improved [43].

The reverse learning proposed by Crick and Mitchison respectively the reverse learning algorithm proposed by Hopfield et al. have an interesting relationship with the learning algorithm proposed by Hinton. The two phases of Hinton’s learning algorithm resemble the learning and unlearning procedures. In positive phase Hebbian learning with positive coefficient occurs

138

during which information in the environment is captured by the weights. During negative phase the system randomly samples states according to their Boltzmann distribution and Hebbian learning occurs with a negative coefficient. However, these two phases need not be implemented in the manner suggested by Crick and Mitchison. For instance, during negative phase the average co–occurrences could be computed without making any changes to the weights. These averages could then be used as a baseline for making changes during positive phase; that is, the co–occurrences during negative phase could be computed and the baseline subtracted before each permanent weight change. Hence, an alternative but equivalent proposal for the function of dream sleep is to recalibrate the baseline for plasticity – the break– even point which determines whether a synaptic weight is incremented or decremented. This would be safer than making permanent weight decrements to synaptic weights during sleep and solves the problem of deciding how much "unlearning" to do [43].

Hinton’s learning algorithm refines Crick’s and Mitchison's interpretation of why two phases are needed. He considered a hidden unit deep within the network and wanted to know how its connections with other units should be changed to best capture regularity present in the environment. He started by observing that, if the unit does not receive direct input from the environment, the hidden unit has no way to determine whether the information it receives from neighboring units is ultimately caused by structure in the environment or is entirely a result of the other weights. Hinton compared this scenario with a "folie à deux" where two parts of the network each construct a model of the other and ignore the external environment [43]. He realized that the contribution of internal and external sources can be separated by comparing the co–occurrences in positive phase with similar information that is collected in the absence or environmental input and in this way the negative phase acts as a control condition. More, because of the special properties of equilibrium, it is possible to subtract off this purely internal contribution and use the difference to update the weights. His conclusion was that the role of two phases is to make the system maximally responsive to regularities present in the environment and to prevent the system from using its capacity to model internally generated regularities [43].

A network like the Boltzmann machine can experience some form of damage. Hinton studied the behavior of the network, specifically the distributed representations constructed by the learning rule, under such circumstances. He observed that the network uses distributed representations among the intermediate units when it learns the associations. His interpretation of this fact was that, because many of the weights are involved in encoding several different

139

associations and each association is encoded in many weights, if a weight is changed because of some form of damage, it will affect several different energy minima and all of them will require the same change in the weight to restore them to their previous depths. So, in relearning any of the associations, there should be a positive transfer effect which tends to restore the others. Hinton observed that this effect is actually rather weak and easily masked, so it can only be seen clearly if the network is retrained on most of the original associations. His conclusion was that the associations constructed by the learning rule are resistant to minor damage and exhibit rapid relearning after major damage. More, the relearning process can bring back associations that are not practiced during the relearning and are only randomly related to the associations that are practiced [43].

140

Chapter 6. Conclusions

6.1 Summary of what has been done

This thesis addresses several aspects of the theory of Boltzmann machines. Our principal goal has been to provide, from a rigorous mathematical perspective, a unified framework for two well–known classes of learning algorithms in asynchronous Boltzmann machines: based on Monte Carlo methods and based on variational approximations of the free energy.

The second chapter focused on the foundation of knowledge necessary to understand the subsequent chapters. We choose to introduce the Boltzmann–Gibbs distribution from both a physicist and a computer scientist perspective to allow the concept of energy, also originating in physics, to settle on solid ground. We introduced the pairwise Markov random fields and explained their relationship with the Boltzmann–Gibbs distribution. We also introduced the Gibbs free energy as a convenient replacement for the Boltzmann–Gibbs distribution when the goal is to perform approximate inference in a Markov random field. Then, we proceeded to introduce the ancestors of Boltzmann machine: the connectionist networks and the Hopfield networks. While we gave only a high–level overview of the connectionist networks, we gave a quite detailed presentation of the Hopfield network. We justified the attention granted to the Hopfield network by the fact that it is not just an ancestor of the Boltzmann machine, but it is a Boltzmann machine itself, as we explained in Chapter 5.

The third chapter built the infrastructure of knowledge necessary to understand the subsequent chapters. The topic of interest in this chapter was energy and the motivation behind is the relationship between the equilibrium distribution and the free energy in a Markov random field. Estimating the distribution of a Markov random field is an expensive process and there is no foolproof method to determine whether the equilibrium has been reached. Some of the difficulties encountered when operating with distributions do not exist anymore when operating with energies. Furthermore, we introduced a number of Gibbs free energies: the mean field free energy and the Bethe–Gibbs free energy, which are variational free energies, and the Bethe free energy. These energies have been purposely defined and analyzed as potential candidates for the true free energy of a Markov random field. Then we presented an approximate inference algorithm – belief optimization – based on the Bethe–Gibbs approximation of the free energy and that could potentially be used in Boltzmann machine learning.

141

The goal of the fourth chapter was to present in detail every important aspect of Boltzmann machine except learning. From various sources we synthetized a rigorous definition of the Boltzmann machine model together with all the associated concepts. We introduced the concept of true energy and explained its intrinsic relationship with various forms of the true probability distribution. We described the algorithmic aspects of the dynamics of the asynchronous Boltzmann machine. In Chapter 5 we returned to this topic and presented it from a different perspective. To allow the reader to intuitively understand the “inner life” of the Boltzmann machine, we presented the biological interpretation of the model as was given by its creator Geoffrey Hinton.

The fifth chapter is the core of this thesis. It was dedicated entirely to learning algorithms in an asynchronous Boltzmann machine. We defined formally the learning process following the same rigorous approach as in Chapter 4. We justified, from different angles, the necessity of two phases in a learning algorithm. Following Hinton’s terminology, we called them positive respectively negative phase. Currently there are two equivalent modalities to approach learning: maximizing the likelihood of parameters or minimizing the KL–divergence of Gibbs measures. We choose to use the KL–divergence approach for all the learning algorithms we presented. Then we introduced the class of learning algorithms based on approximate maximum likelihood. This class contains the generic learning algorithm due to Hinton and Sejnowski. We provided a very detailed analysis of the generic learning algorithm including the missing piece from the original algorithm, which was identified by Jones. The class of algorithms based on approximate maximum likelihood was completed with the introduction of three sampling algorithms used to collect the statistics during both positive and negative phase: Gibbs sampling, stochastic approximation using persistent Markov chains, and contrastive divergence. We summarized the main steps of the generic learning algorithm in a high–level pseudocode and discussed the factors that influenced its complexity. The collection of statistics for the generic learning algorithm is conditioned on the thermal equilibrium. To understand the dynamics of the Boltzmann machine from a probabilistic point of view, we provided a deep analysis of the equilibrium distribution as a function of the pseudo–temperature. Furthermore, we introduced the class of learning algorithms based on variational approaches discussed in Chapter 3 and we explained the connection between the approximations of the free energies and the learning process. We provided a detailed analysis of the mean field learning and the connections it has with algorithms introduced previously: mean field approximation and stochastic approximation. We summarized the main steps of the mean field learning algorithm in a high–level pseudocode and discussed the factors that influenced its complexity. Finally, we introduced the concepts of

142

unlearning and relearning and gave the intuition behind them as was explained by their creator Geoffrey Hinton.

6.2 Future directions

There are a few open questions or directions to explore inspired by ideas presented in this thesis:

1. an algorithm to detect when an asynchronous symmetric Boltzmann machine reached its equilibrium distribution;

2. an explicit formula for the equilibrium distribution and a learning algorithm for a Boltzmann machine with asymmetric weights;

3. is possible to extend the learning algorithm to higher order Markov processes?

4. an improvement to the Boltzmann machine model itself that would lead to better and faster learning algorithms;

5. a breakthrough in Boltzmann machine learning?

Solutions to some of these questions would represent a considerable improvement on the current state of knowledge.

One idea to improve the model is to find link(s) between the energy of a thermodynamic system with respect to pressure and volume (equation (2.17)) and some aspect(s) of the cognitive features and/or processes of human brain that, importantly, can be represented in an artificial neural network like the Boltzmann machine. If these connections existed and had been reflected in the Boltzmann machine model, they could be “consumed” directly by new learning algorithms or indirectly by new optimization algorithms that perform approximate inference in the underlying Markov random field.

143

References

1. Sussmann, H. J. (1988, December). Learning algorithms for Boltzmann machines. In Decision and Control, 1988., Proceedings of the 27th IEEE Conference on (pp. 786-791). IEEE. 2. Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York, 4. 3. Sussmann, H. J. The mathematical theory of learning algorithms for Boltzmann machines. In Neural Networks, 1989. IJCNN., International Joint Conference on (pp. 431-437). IEEE. 4. Welling, M., & Teh, Y. W. (2003). Approximate inference in Boltzmann machines. , 143(1), 19-50. 5. Salakhutdinov, R. (2008). Learning and evaluating Boltzmann machines (p. 31). Technical Report UTML TR 2008-002, Department of Computer Science, University of Toronto. 6. Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7), 1527-1554. 7. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8), 2554-2558. 8. Hinton, G. E., & Sejnowski, T. J. (1983, June). Optimal perceptual inference. In Proceedings of the IEEE conference on and Pattern Recognition (pp. 448-453). IEEE New York. 9. Fahlman, S. E., Hinton, G. E., & Sejnowski, T. J. (1983). Massively parallel architectures for Al: NETL, THISTLE, and BOLTZMANN machines. Proceedings of AAAI-83109, 113. 10. Hinton, G. E., Sejnowski, T. J., & Ackley, D. H. (1984). Boltzmann machines: Constraint satisfaction networks that learn. Pittsburgh, PA: Carnegie-Mellon University, Department of Computer Science. 11. Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive science, 9(1), 147-169. 12. Kirkpatrick, S. (1984). Optimization by simulated annealing: Quantitative studies. Journal of statistical physics, 34(5-6), 975-986. 13. Salakhutdinov, R., & Hinton, G. (2012). An efficient learning procedure for deep Boltzmann machines. Neural computation, 24(8), 1967-2006. 14. Neal, R. M. (1992). Connectionist learning of belief networks. Artificial intelligence, 56(1), 71-113. 15. Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory (No. CU-CS-321-86). COLORADO UNIV AT BOULDER DEPT OF COMPUTER SCIENCE. 16. Peterson, C. (1987). A mean field theory learning algorithm for neural networks. Complex systems, 1, 995-1019. 17. Peterson, C., & Hartman, E. (1989). Explorations of the mean field theory learning algorithm. Neural Networks, 2(6), 475-494. 18. Galland, C. C., & Hinton, G. E. (1990). Discovering high order features with mean field modules. In Advances in neural information processing systems (pp. 509-515). 19. Galland, C. (1992). Learning in deterministic Boltzmann machine networks.

144

20. Kappen, H. J., & Rodriguez, F. B. (1998). Boltzmann machine learning using mean field theory and linear response correction. Advances in neural information processing systems, 280-286. 21. Kappen, H. J., & Rodrı́guez, F. B. (1997). Mean field approach to learning in Boltzmann machines. Pattern Recognition Letters, 18(11), 1317-1322. 22. Tanaka, T. (1998). Mean-field theory of Boltzmann machine learning. Physical Review E, 58(2), 2302. 23. Tanaka, T. (1999). A theory of mean field approximation. Advances in Neural Information Processing Systems, 351-360. 24. Zemel, R. S. (1993). A minimum description length framework for unsupervised learning (Doctoral dissertation, University of Toronto). 25. Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length, and Helmholtz free energy. Advances in neural information processing systems, 3-3. 26. Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models (pp. 355-368). Springer Netherlands. 27. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37(2), 183-233. 28. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2003). Understanding belief propagation and its generalizations. Exploring artificial intelligence in the new millennium, 8, 236-239. 29. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2005). Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Transactions on , 51(7), 2282- 2312. 30. Wainwright, M. J., Jaakkola, T. S., & Willsky, A. S. (2005). A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51(7), 2313-2335. 31. Wainwright, M. J., & Jordan, M. I. (2006). Log-determinant relaxation for approximate inference in discrete Markov random fields. IEEE Transactions on Signal Processing, 54(6), 2099-2109. 32. Globerson, A., & Jaakkola, T. S. (2007). Approximate inference using conditional entropy decompositions. In AISTATS (pp. 130-138). 33. Kabashima, Y., & Saad, D. (1998). Belief propagation vs. TAP for decoding corrupted messages. EPL (Europhysics Letters), 44(5), 668. 34. Opper, M., & Winther, O. (1996). Mean field approach to Bayes learning in feed-forward neural networks. Physical review letters, 76(11), 1964. 35. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2000, December). Generalized belief propagation. In NIPS (Vol. 13, pp. 689-695). 36. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2001). Bethe free energy, Kikuchi approximations, and belief propagation algorithms. Advances in neural information processing systems, 13. 37. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural computation, 14(8), 1771-1800.

145

38. Salakhutdinov, R., & Hinton, G. E. (2007, March). Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure. In AISTATS (pp. 412-419). 39. Salakhutdinov, R., & Hinton, G. E. (2009, April). Deep Boltzmann Machines. In AISTATS (Vol. 1, p. 3). 40. Little, W. A. (1974). The existence of persistent states in the brain. In From High-Temperature Superconductivity to Microminiature Refrigeration (pp. 145-164). Springer US. 41. Little, W. A., & Shaw, G. L. (1978). Analytic study of the memory storage capacity of a neural network. Mathematical biosciences, 39(3-4), 281-290. 42. Viveros, U. X. I. (2001). The Synchronous Boltzmann Machine (Doctoral dissertation, University of London). 43. Hinton, G. E., & Sejnowski, T. J. (1986). Learning and releaming in Boltzmann machines. Parallel distributed processing: Explorations in the microstructure of cognition, 1, 282-317. 44. Boltzmann, L. (2012). Theoretical physics and philosophical problems: Selected writings (Vol. 5). Springer Science & Business Media. 45. Gibbs, J. W. (1928). The collected works of J. Willard Gibbs (Vol. 1, pp. p-288). H. A. Bumstead, & W. R. Longley (Eds.). Longmans, Green and Company. 46. Dobrushin, R. L. (1968). Description of a random field by means of conditional probabilities, with applications. Teor. Veroyatnost. i Primenen, 13. 47. Wainwright, M. J., & Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1-2), 1-305. 48. Spitzer, F. (1971). Markov random fields and Gibbs ensembles. The American Mathematical Monthly, 78(2), 142-154. 49. Yedidia, J. (2001). An idiosyncratic journey beyond mean field theory. Advanced mean field methods: Theory and practice, 21-36. 50. Gibbs, J. W. (1873). A method of geometrical representation of the thermodynamic properties of substances by means of surfaces. Connecticut Academy of Arts and Sciences. 51. Hinton, G. E. (1989). Connectionist learning procedures. Artificial intelligence, 40(1), 185-234. 52. Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the national academy of sciences, 81(10), 3088-3092. 53. Lowel, S., & Singer, W. (1992). Selection of intrinsic horizontal connections in the visual cortex by correlated neuronal activity. Science, 255(5041), 209. 54. Koller, D., & Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT press. 55. Georges, A., & Yedidia, J. S. (1991). How to expand around mean-field theory using high- temperature expansions. Journal of Physics A: Mathematical and General, 24(9), 2173. 56. Plefka, T. (1982). Convergence condition of the TAP equation for the infinite-ranged Ising spin glass model. Journal of Physics A: Mathematical and general, 15(6), 1971.

146

57. Plefka, T. (2006). Expansion of the Gibbs potential for quantum many-body systems: General formalism with applications to the spin glass and the weakly nonideal Bose gas. Physical Review E, 73(1), 016129. 58. Shin, J. (2012). Complexity of Bethe Approximation. In AISTATS (pp. 1037-1045). 59. Weiss, Y., & Freeman, W. T. (2001). On the optimality of solutions of the max-product belief- propagation algorithm in arbitrary graphs. IEEE Transactions on Information Theory, 47(2), 736-744. 60. Bishop, C. (2007). Pattern Recognition and Machine Learning (Information Science and Statistics), 1st edn. 2006. corr. 2nd printing edn. 61. Heskes, T. (2002). Stable fixed points of loopy belief propagation are local minima of the Bethe free energy. In Advances in neural information processing systems (pp. 343-350). 62. Heskes, T. (2004). On the uniqueness of loopy belief propagation fixed points. Neural Computation, 16(11), 2379-2413. 63. Jones, A. (1996). A lacuna in the theory of asynchronous Boltzmann machine learning. Simpósio Brasileiro de Redes Neurais, 19-27. 64. Hinton, G. E. (1989). Deterministic Boltzmann learning performs steepest descent in weight-space. Neural computation, 1(1), 143-150. 65. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence, (6), 721-741. 66. Tieleman, T. (2008, July). Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning (pp. 1064-1071). ACM. 67. Robbins, H., & Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, 400-407. 68. Younes, L. (1989). Parametric inference for imperfectly observed Gibbsian fields. Probability theory and related fields, 82(4), 625-645. 69. Younes, L. (1999). On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. : An International Journal of Probability and Stochastic Processes, 65(3- 4), 177-228. 70. Yuille, A. L. (2006). The convergence of contrastive divergences. Department of Statistics, UCLA. 71. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and trends® in Machine Learning, 2(1), 1-127. 72. Carreira-Perpinan, M. A., & Hinton, G. (2005, January). On Contrastive Divergence Learning. In AISTATS (Vol. 10, pp. 33-40). 73. Gantmacher, F. R., & Brenner, J. L. (2005). Applications of the Theory of Matrices. Courier Corporation. 74. Gantmacher, F. R. (1959). Matrix theory, vol.1 and 2. New York. 75. Aarts, E., & Korst, J. (1988). Simulated annealing and Boltzmann machines.

147

76. Crick, F., & Mitchison, G. (1983). The function of dream sleep. Nature, 304(5922), 111-114. 77. Hopfield, J. J., Feinstein, D. I., & Palmer, R. G. (1983). ‘Unlearning’ has a stabilizing effect in collective memories. 78. Levin, D. A., Peres, Y., & Wilmer, E. L. (2009). Markov chains and mixing times. American Mathematical Soc.

148

Appendix A: Mathematical notations

In this appendix we present the main conventions and notations used throughout this paper.

 We use |A| to denote the cardinality of a finite set A.  We denote matrices by uppercase bold roman letters, such as 퐀.  We use a superscript T to denote the transpose of a matrix or vector.  Without restricting the generality, we assume that a set of processing units (neurons) 𝒩 is indexed by the set of natural numbers {1,2, … , 푛} for 푛 = |𝒩| ∈ ℕ. We also make the convention that "0" denotes some object not belonging to 𝒩.  A random variable is generally denoted 푋 (typeface italic uppercase x).  A univariate (scalar) random variable is denoted as a general random variable.  An individual observation of a scalar random variable is denoted by 퓍 (script italic lowercase x). A set comprising of 푚 observations of a scalar random variable 푋 is denoted by 합 (double–struck lowercase x) and is written as:

합 ≡ (푥(1), 푥(2), … , 푥(푚)) (A1)

 A multivariate random variable is denoted by X (typeface uppercase x):

T X ≡ (푋1, 푋2, … , 푋n) (A2)

 We use the notation X−i to designate all the random variables from X except 푋푖, i.e.:

X−i = (푋1, … , 푋푖−1, 푋푖+1, … , 푋푛) (A3)

 We use the symbol ⊥ to represent the conditional independence of random variables. Example: 퐴 ⊥ 퐵 | 퐶.  We use bold uppercase letters to designate probability distributions. Examples: 퐏, 퐐.  We use the accent “bar” to denote a probability candidate for an unknown probability. Example: Probability distribution 퐏̅ is an approximation for the probability distribution 퐏.  We use the accent “tilde” to denote the unnormalized measure of a probability distribution. Example: 퐏̃ is the unnormalized measure of probability distribution 퐏.  We use the accent “hat” to identify the collection of canonical parameters corresponding to bi–dimensional cliques (edges) in a pairwise Markov network. The collection of all the canonical parameters is represented with the same letter as the previous collection but without “hat”. Example: the first collection is 퐖̂ ; the second collection is 퐖.

149

Appendix B: Probability theory and statistics

The definitions and theoretical results presented in this appendix are taken from the books [54] and [78]. They are notions from probability theory and statistics that have been used in the previous sections.

In probability and statistics, a Random (Stochastic) Variable, usually written 푋, is a variable whose value is subject to variations due to chance (i.e., randomness, in a mathematical sense); otherwise its values are numerical outcomes of a random phenomenon or experiment. As opposed to other mathematical variables, a random variable conceptually does not have a single, fixed value (even if unknown); rather, it can take on a set of possible different values, each with an associated probability. Based on the number of values that constitute the random variable associated to a statistical unit, the random variables are classified into two categories: univariate and multivariate.

A Univariate Random Variable or Random Scalar is a single variable whose value is unknown, either because the value has not yet occurred, or because there is imperfect knowledge of its value. Normally a random scalar is a real number.

A Multivariate Random Variable or Random Vector, usually written X, is a list of mathematical variables whose value for each of them is unknown, either because the value has not yet occurred, or because there is imperfect knowledge of its value. The individual variables in a random vector are grouped together because there may be correlations among them – often they represent different properties of an individual statistical unit (e.g. a particular person, event, etc.). Normally each element of a random vector is a real number.

In mathematics, a moment is a specific quantitative measure of the shape of a set of points. The moments of a random variable (or of its distribution) are expected values of powers or related functions of the random variable. The first moment, also called mean, is a measure of the center or location of a random variable or distribution. The second moment of a random variable is also called variance and its square root is called standard deviation. The variance and standard deviation are measures of the scale or spread of a random variable or distribution.

 𝝈–algebra: Given a set Ω, a 휎–algebra is a collection ℱ of subsets satisfying the following conditions:  Ω ∈ ℱ; ∞  if 퐴1, 퐴2, … are elements of ℱ, then ⋃푖=1 퐴푖 ∈ ℱ;

150

 if 퐴 ∈ ℱ, then 퐴퐶= Ω − A ∈ ℱ.

 Probability space:

A probability space is a three–tuple (Ω, ℱ, 푝) in which the three components are:

 Sample space: A nonempty set Ω called the sample space, which represents all possible outcomes.  Event space: A collection ℱ of subsets of Ω, called the event space. The elements of ℱ are called events.

If Ω is discrete, then ℱ is usually the collection of all subsets of Ω: ℱ = pow(Ω). If Ω is continuous, then ℱ is usually a 휎–algebra on Ω.

 Probability function: A function 푝 ∶ ℱ ⟶ ℝ that assigns probabilities to the events of ℱ and satisfies the requirements of a probability measure over Ω as specified below.

An outcome is the result of a single execution of the model. Once the probability space is established, it is assumed that “nature” makes its move and selects a single outcome ω from the sample space Ω. All the events in ℱ that contain the selected outcome ω (recall that each event is a subset of Ω) are said to “have occurred”. The selection performed by “nature” is done in such a way that, if the experiment were to be repeated an infinite number of times, the relative frequencies of occurrence of each of the events would coincide with the probabilities prescribed by the function 푝.

 Borel 𝝈–algebra:

If a probability space Ω is a countable set, the 휎–algebra of events is usually taken to be pow(Ω). If Ω is ℝ푑, then the Borel 휎–algebra is the smallest 휎–algebra containing all open sets.

 Probability measure:

Given a probability space, a probability measure is a non–negative function 퐏 defined on events and satisfying the following:

 퐏(Ω) = 1;

 for any sequence of events 퐵1, 퐵2, … which are disjoint, meaning 퐵푖 ∩ 퐵푗 = ∅ for 푖 ≠ 푗:

∞ ∞ 퐏(⋃ 퐵푖) = ∑ 퐏(퐵푖) (B1) 푖=1 푖=1

151

 Probability distribution:

If Ω is a countable set, a probability distribution (or sometimes simply a probability) on Ω is a function 푝 ∶ Ω ⟶ [0, 1] such that:

∑ 푝(ω) = 1 (B2) ω∈Ω

We will abuse notation and write, for any subset 퐴 ⊂ Ω, 푝(퐴) = ∑ω∈A 푝(ω). The set function 퐴 ⟶ 푝(퐴) is a probability measure.

 Measurable function:

A function 푓: Ω ⟶ ℝ is called measurable if 푓−1(퐵) is an event for all open sets 퐵.

 Density function:

If Ω = 퐷 is an open subset of ℝ푑 and 푓 ∶ 퐷 ⟶ [0, ∞) is a measurable function satisfying

( ) , then is called a density function. ∫퐷 푓 푥 푑푥 = 1 푓

Given a density function, the following set function defined for Borel sets 퐵 is a probability measure:

휇푓(퐵) = ∫ 푓(푥) 푑푥 (B3) 퐵

 Random variable:

Given a probability space, a random variable 푋 is a measurable function defined on Ω. We write {푋 ∈ 퐴} as shorthand for the set:

푋−1(퐴) = {ω ∈ Ω ∶ 푋(ω) ∈ 퐴} (B4)

 Distribution of a random variable:

The distribution of a random variable 푋 is the probability measure 휇푋 on ℝ defined for Borel set 퐵 by:

휇푋(퐵) = 퐏({푋 ∈ 퐵}) = 퐏{푋 ∈ 퐵} (B5)

 Types of random variables:

We call a random variable X discrete if there is a finite or countable set 푆, called the support of

푋, such that 휇푋(푆) = 1. In this case, the following function is a probability distribution on 푆:

152

푝푋(푎) = 퐏{푋 = 푎} (B6)

We call a random variable 푋 absolutely continuous if there is a density function 푓 on ℝ such that:

휇푋(퐴) = ∫ 푓(푥) 푑푥 (B7) 퐴

 Mean or expectation:

For a discrete random variable 푋, the mean or expectation 퐄(푋) can be computed by the following formula whose sum has at most countably many non–zero terms:

퐄퐏(푋) = 퐄퐏[푋] = 퐄[푋] = ∑ 푥 ∙ 퐏(푋 = 푥) (B8) 푥∈ℝ

For an absolutely continuous random variable 푋, the expectation 퐄(푋) is computed by the formula:

퐄퐟(푋) = 퐄퐟[푋] = 퐄[푋] = ∫ 푥 ∙ 푓푋(푥) 푑푥 (B9) ℝ

 Variance:

The variance of a random variable 푋 is defined by:

퐕퐚퐫(푋) = 퐕퐚퐫[푋] = 퐄[(푋 − 퐄[푋])ퟐ] (B10)

If 푋 is a random variable, 푔 ∶ ℝ ⟶ ℝ is a function, and 푌 = 푔(푋) is a function of 푋, then the expectation 퐄[푌] can be computed via the formulae:

∫ 푔(푥) ∙ 푓(푥) 푑푥, if 푋 is continuous with density 푓 퐄[푌] = (B11) ∑ 푔(푥) ∙ 푝푋(푥) , if 푋 is discrete with support 푆 { 푥∈푆

 Standard deviation:

The standard deviation of a random variable 푋 is defined as the (nonnegative) square root of its variance:

훔푋 = √퐕퐚퐫[푋] (B12)

 Covariance:

153

The covariance between two jointly distributed real–valued random variables 푋 and 푌 with finite variances is defined as:

퐜퐨퐯(푋, 푌) = 퐄[(푋 − 퐄[푋]) ∙ (푌 − 퐄[푌])] = 퐄[푋 ∙ 푌] − 퐄[푋] ∙ 퐄[푌] (B13)

 Correlation coefficient:

The population correlation coefficient between two random variables 푋 and 푌 with expected values 퐄[푋] and 퐄[푋] and standard deviations 훔푋 and 훔푌 is defined as:

퐜퐨퐯(푋, 푌) 𝝆푿,풀 = 퐜퐨퐫퐫(푋, 푌) = (B14) 훔푋 ∙ 훔푌

The sample correlation coefficient between two data sets 퐗 = {x1, … , x푛} and 퐘 = {푦1, … , y푛}, each of them containing 푛 values, is defined as:

∑풊 xi ∙ yi − 푛 ∙ 퐱̅ ∙ 퐲̅ 풓푿,풀 = 퐜퐨퐫퐫(퐗, 퐘) = (B15) (n − 1) ∙ 퐬퐗 ∙ 퐬퐘 where: 퐬퐗 and 퐬퐘 represent the sample standard deviation for 퐗 respectively 퐘; and 퐱̅ and 퐲̅ represent the sample mean for 퐗 respectively 퐘.

 Independence:

Fix a probability space and a probability measure 퐏. Two events 퐴 and 퐵 are independent if:

퐏(퐴 ∩ 퐵) = 퐏(퐴) ∙ 퐏(퐵) (B16)

Events 퐴1, 퐴2, … are independent if for any 푖1, 푖2, … 푖푟:

퐏(퐴푖1 ∩ 퐴푖2 ∩ … 퐴푖푟 ) = 퐏(퐴푖1) ∙ 퐏(퐴푖2) ∙ … ∙ 퐏(퐴푖푟 ) (B17)

Random variables 푋1, 푋2, … are independent if for all Borel sets 퐵1, 퐵2, … the events {푋1 ∈

퐵1}, {푋2 ∈ 퐵2}, … are independent.

Proposition B.1: If 푋 and 푌 and independent random variables such that 퐕퐚퐫(푋) and 퐕퐚퐫(푌) exists, then:

퐕퐚퐫[푋 + 푌] = 퐕퐚퐫[푋] + 퐕퐚퐫[푌] (B18)

 Theorem B.2 (Markov’s inequality):

For a non–negative random variable 푋:

퐄(푋) 퐏{푋 > 푎} ≤ (B19) 푎

154

 Convergence in probability:

A sequence of random variables (푋푡) converges in probability to a random variable 푋 if:

lim 퐏{|푋푡 − 푋| > 휀} = 0 for all 휀 ∈ ℝ (B20) 푡→∞

푝푟 This is denoted by: 푋푡 → 푋.

 Theorem B.3 (Convergence for sequence of random variables):

Let (푌푡) be a sequence of random variables and 푌 be a random variable such that:

퐏 { lim 푌푛 = 푌} = 1 (B21) 푛→∞

Bounded Convergence:

If there is a constant 푘 ≥ 0 independent of 푛 such that |푌푛| < 푘 for all 푛 ∈ ℕ, then:

lim 퐄[푌푛] = 퐄[푌] (B22) 푛→∞

Dominated Convergence:

If there is a random variable 푍 such that 퐄[|푍|] < ∞ and 퐏{|푌푛| ≤ |푍|} = 1 for all 푛 ∈ ℕ, then:

lim 퐄[푌푛] = 퐄[푌] (B23) 푛→∞

Monotone Convergence:

If 퐏{푌푛 ≤ 푌푛+1} = 1 for all 푛 ∈ ℕ, then:

lim 퐄[푌푛] = 퐄[푌] (B24) 푛→∞

 Entropy of a univariate random variable:

Let 퐏(푋) be a distribution over a univariate random variable 푋. The entropy of 푋 is defined:

1 1 S (푋) = 퐄 [log ] = ∑ 퐏(푋) ∙ log = − ∑ 퐏(푋) ∙ log 퐏(푋)] 퐏 퐏(푋) 퐏(푋) (B25) 푋 푋

1 1 where we treat 0 ∙ log = 0 because: lim 휀 ∙ log = 0. 0 휖→0 휀

The entropy can be viewed as a measure of our uncertainty about the value of 푋.

 Entropy of a multivariate random variable:

The previous definition extends naturally to multivariate random variables. Let 퐏(푋1, … , 푋푛) be a distribution over random variables 푋1, … , 푋푛.Then the joint entropy of 푋1, … , 푋푛 is:

155

1 1 S퐏(푋1, … , 푋푛) = 퐄 [log ] = ∑ 퐏(푋1, … , 푋푛) ∙ log (B26) 퐏(푋1, … , 푋푛) 퐏(푋1, … , 푋푛) 푋1,…,푋푛

(B27) S퐏(푋1, … , 푋푛) = − ∑ 퐏(푋1, … , 푋푛) ∙ log 퐏(푋1, … , 푋푛)

푋1,…,푋푛

 Distance between distributions:

There are situations when we want to compare two distributions. For instance, we might want to approximate a distribution by one with desired qualities, e.g. a simpler representation or more efficient to reason with; we also want to evaluate the quality of a candidate approximation. Another example is in the context of learning a distribution from data, where we want to compare the learned distribution to the “true” distribution from which the data was generated.

Therefore, we want to construct a distance measure 푑 that evaluates the distance between two distributions. There are some properties we might wish for in such a measure:

Positivity: 푑(퐏, 퐐) is always nonnegative and is zero if and only if 퐏 = 퐐.

Symmetry: 푑(퐏, 퐐) = 푑(퐐, 퐏).

Triangle inequality: for any three distributions 퐏, 퐐, 퐑 we have that:

푑(퐏, 퐑) ≤ 푑(퐏, 퐐) + 푑(퐐, 퐑) (B28)

A distance measure that satisfies these three properties is called a distance metric.

 Relative entropy and KL–divergence:

Let 퐐 and 퐏 be two distributions over random variables 푋1, … , 푋푛. The relative entropy of 퐐 and 퐏 is:

퐐(푋1, … , 푋푛) KL(퐐(푋1 … 푋푛)||퐏(푋1 … 푋푛)) = 푬푸 [log ] (B29) 퐏(푋1, … , 푋푛)

퐐(푋1, … , 푋푛) (B30) KL(퐐(푋1 … 푋푛)||퐏(푋1 … 푋푛)) = ∑ 퐐(푋1 … 푋푛) ∙ log 퐏(푋1, … , 푋푛) 푋1,…,푋푛

When the set of variables is clear from context we use the shorthand definition: KL(퐐||퐏). This measure is often known as the Kullback–Leibler divergence or KL–divergence.

The relative entropy measures the additional cost imposed by using a wrong distribution 퐐 instead of the true distribution 퐏. Thus, 퐐 is close in the sense of relative entropy to 퐏 if this cost

156

is small. The additional cost of using the wrong distribution is always positive. Moreover, the relative entropy is zero if and only if the two distributions are identical.

Unfortunately, positivity is the only property of distances that relative entropy satisfies; it satisfies neither symmetry nor triangle inequality.

157

Appendix C: Finite Markov chains

The notions presented in this appendix have been used in the previous sections and the majority of them are taken from the book [78].

A finite Markov chain is a process which moves among the elements of a finite set Ω in the following manner: considering 푥 ∈ Ω is the current position of the process, the next position is chosen according to a fixed probability distribution 푃(푥,·). We formally define this type of process and present some of its properties.

 Finite Markov chain:

A sequence of random variables 푋1, 푋2, … is a finite Markov chain with finite state space Ω and 푡−1 transition matrix 푃 if for all 푥, 푦 ∈ Ω, all 푡 ≥ 1, and all events 퐻푡−1 = ⋂푠=0(푋푠 = 푥푠) satisfying

퐏(퐻푡−1 ∩ {푋푡 = 푥}) > 0, we have:

퐏{푋푡+1 = 푦 | 퐻푡−1 ∩ {푋푡 = 푥}} = 퐏{푋푡+1 = 푦 | 푋푡 = 푥} = 푃(푥, 푦) (C1)

Equation (C1) illustrates how the Markov chain explores the space in a local fashion. Often called the Markov local property, equation (C1) means that the conditional probability of proceeding from state 푥 to state 푦 is the same, no matter what sequence 푥0, 푥1, … 푥푡−1 of states precedes the current state 푥. This is exactly why the |Ω| × |Ω| matrix 푃 suffices to describe the transitions.

Let the distribution 푃(푥,·) be the 푥푡ℎ row of the transition matrix 푃. Thus, 푃 is stochastic, that is, its entries are all non–negative and:

∑ 푃(푥, 푦) = 1 (C2) 푦∈Ω

Let (푋1, 푋2, … ) be a finite Markov chain with state space Ω and transition matrix 푃, and let the row vector 휇푡 be the distribution of 푋푡:

휇푡(푥) = 퐏{푋푡 = 푥} for all 푥 ∈ Ω

By conditioning on the possible predecessors of the (푡 + 1)st state, we see that:

휇푡+1(푦) = ∑ 퐏{푋푡 = 푥} ∙ 푃(푥, 푦) = ∑ 휇푡(푥) ∙ 푃(푥, 푦) for all y ∈ Ω 푥∈Ω 푥∈Ω

Rewriting this in vector form gives:

158

휇푡+1 = 휇푡 ∙ 푃 for 푡 ≥ 0 (C3) hence: 푡 휇푡 = 휇0 ∙ 푃 for 푡 ≥ 0 (C4)

 Irreducible finite Markov chain:

A Markov chain with transition matrix 푃 is called irreducible if for any two states 푥, 푦 ∈ Ω, there exists an integer 푡 (possibly depending on 푥 and 푦) such that 푃푡(푥, 푦) > 0.

This means that it is possible to get from any state to any other state not necessarily in one step and using only transitions of positive probability.

Lemma C.1: A finite irreducible Markov chain with state space Ω and transition matrix 푃 =

(푝푖푗)1≤푖,푗≤푚 is aperiodic if there exists a state 푥푗 ∈ Ω, where 1 ≤ 푗 ≤ 푚, such that 푝푗푗 > 0.

 Periodic finite Markov chain:

Let 푇(푥) = {푡 ≥ 1 ∶ 푃푡(푥, 푥) > 0} be the set of times when it is possible for the chain to return to starting position 푥. The period of state 푥 is defined to be the greatest common divisor of 푇(푥).

 Stationary distribution:

A stationary distribution of a Markov chain 푃 is a probability 휋 on Ω invariant under right multiplication by 푃, which means:

휋 = 휋 ∙ 푃 (C5)

In this case, 휋 is the long–term limiting distribution of the Markov chain. Clearly, if 휋 is a stationary distribution and 휇0 = 휋, i.e., the chain is started in a stationary distribution, then

휇푡 = 휋 for all 푡 ≥ 0. Equation (C5) can be rewritten element–wise as:

휋(푦) = ∑ 휋(푥) ∙ 푃(푥, 푦) for all 푦 ∈ Ω (C6) 푥∈Ω

The finite Markov chains converge to their stationary distributions. More, under mild restrictions, stationary distributions exist and are unique.

There is a difference between multiplying a row vector by 푃 on the right and a column vector by 푃 on the left: the former advances a distribution by one step of the chain, while the latter gives the expectation of a function on states, one step of the chain later.

 Hitting and stopping time:

For x ∈ Ω, we define the hitting time for 푥 to be the first time at which the chain visits state 푥:

159

휏푥 = min{푡 ≥ 0 ∶ 푋푡 = 푥} (C7)

For situations where only a visit to 푥 at a positive time will do, we also define:

+ 휏푥 = min{푡 ≥ 1 ∶ 푋푡 = 푥} (C8)

+ We call 휏푥 the first return time when 푋0 = 푥.

A stopping time 휏 for (푋푡) is a {0, 1, . . . , } ∪ {∞}–valued random variable such that, for each 푡, the event {휏 = 푡} is determined by 푋0, 푋1, … 푋푡. If 휏 is a stopping time, then an immediate consequence of the definition and the Markov property is:

{ ) | ) 푃푥0 (푋휏+1, 푋휏+2, … 푋푙 ∈ 퐴 휏 = 푘 and (푋1, … 푋푘 = (푥1, … 푥푘) } = 푃푥푘 {(푋1, … 푋푙) ∈ 퐴} (C9) for any 퐴 ⊂ Ω푙. This is referred to as the strong Markov property. Informally, we say that the chain “starts afresh” at a stopping time.

Lemma C.2: For any states 푥 and 푦 of an irreducible chain:

+ 퐄푥(휏푦 ) < ∞ (C10)

 Theorem C.3 (Existence of a stationary distribution):

Let 푃 be the transition matrix of an irreducible Markov chain. Then the following are true: there exists a probability distribution 휋 on Ω such that:

휋 = 휋 ∙ 푃 and 휋(푥) > 0 for all 푥 ∈ Ω (C11)

1 휋(푥) = + (C12) 퐄푥(휏푥 )

 Theorem C.4 (Uniqueness of the stationary distribution):

Let 푃 be the transition matrix of an irreducible Markov chain. There exists a unique probability distribution 휋 satisfying: 휋 = 휋 ∙ 푃.

 Reversibility and Time Reversals:

Suppose a probability 휋 on Ω satisfies the detailed balance equation:

휋(푥) ∙ 푃(푥, 푦) = 휋(푦) ∙ 푃(푦, 푥) for all 푥, 푦 ∈ Ω (C13)

Proposition C.5: Let 푃 be the transition matrix of a Markov chain with state space Ω. Any distribution 휋 satisfying the detailed balance equations is stationary for 푃.

160

Checking detailed balance is often the simplest way to verify that a particular distribution is stationary. Furthermore, when the detailed balance equation holds:

휋(푥0) ∙ 푃(푥0, 푥1) ∙ … ∙ 푃(푥푛−1, 푥푛) = 휋(푥푛) ∙ 푃(푥푛, 푥푛−1) ∙ … ∙ 푃(푥1, 푥0) (C14)

We can rewrite the previous equation in the following suggestive form:

푃휋{X0 = x0, … , X푛 = x푛} = 푃휋{X0 = x푛, … , X푛 = x0} (C15)

In other words, if a chain (푋푡) satisfies the detailed balance equation and has stationary initial distribution, then the distribution of (푋0, 푋1, … 푋푛) is the same as the distribution of

(푋푛, 푋푛−1, … 푋0).

 Reversible finite Markov chain:

A chain satisfying the detailed balance equation is called reversible. The time reversal of an irreducible Markov chain with transition matrix 푃 and stationary distribution 휋 is the chain with matrix:

휋(푦) ∙ 푃(푦, 푥) 푃̂(푥, 푦) = (C16) 휋(푥)

The stationary equation 휋 = 휋 ∙ 푃 implies that 푃̂ is a stochastic matrix. We write (푋̂푡) for the time–reversed chain of (푋푡) and 푃̂ for the transition matrix of (푋̂푡).

Proposition C.6: Let (푋푡) be an irreducible Markov chain with transition matrix 푃 and stationary distribution 휋. Then 휋 is stationary for 푃̂ and for any 푥0, 푥1, … 푥푡∈ Ω we have:

푃휋{X0 = x0, … , X푡 = x푡} = 푃휋{X̂0 = x푡, … , X̂푡 = x0} (C17)

Observe that if a chain with transition matrix 푃 is reversible, then: 푃̂ = 푃.

 Theorem C.7 (Markov Chain Convergence):

Suppose that a Markov chain 푃 is irreducible and aperiodic, with stationary distribution 휋. Then there exist constants 훼 ∈ (0, 1) and 퐶 > 0 such that:

푡 푡 max ||푃 (푥,·) − 휋||푇푉 ≤ 퐶 ∙ 훼 (C18) 푥∈Ω where ||휇 − 휈||푇푉 represents the total variation distance between two probability distributions 휇 and 휈 on Ω and is defined as:

||휇 − 휈|| = max|휇(퐴) − 휈(퐴)| (C19) 푇푉 퐴⊂Ω

161

This theorem implies that the “long–term” fractions of time a finite irreducible aperiodic Markov chain spends in each state coincide with the chain’s stationary distribution.

 Ergodicity:

A Markov chain is said to be ergodic if there exists a positive integer 푇0 such that, for all pairs of states 푖 and 푗 in the Markov chain, if the chain is started at time 0 in state 푖, then for all 푡 > 푇0, the probability of being in state 푗 at time 푡 is greater than 0.

For a Markov chain to be ergodic two technical conditions are required of its states and its transition matrix: irreducibility and aperiodicity. Informally, irreducibility ensures that there is a sequence of transitions of non–zero probability from any state to any other, while aperiodicity ensures that the states are not partitioned into sets such that all state transitions occur cyclically from one set to another.

 Theorem C.8 (Ergodic Theorem):

Let 푓 be a real–valued function defined on Ω. If (푋푡) is an irreducible Markov chain with stationary distribution 휋, then for any starting distribution 휇, the following holds:

푡−1 1 푃휇 {lim ∙ ∑ 푓(푋푠) = 퐄휋[푓]} = 1 (C20) 푡→∞ 푡 푠=0 where 퐄휋[푓] is computed with formula (B9).

 Markov chain Monte Carlo (MCMC):

Problem: Given an irreducible transition matrix 푃, there is a unique stationary distribution 휋 on Ω satisfying 휋 = 휋 ∙ 푃.

Answer: The existence and unicity of solution of this problem is ensured by Theorem C.3 respectively Theorem C.4.

Inverse problem: Given a probability distribution 휋 on Ω, can we find a transition matrix 푃 for which 휋 is its stationary distribution?

Answer: Yes, we do. The solution involves a method of sampling from a given probability distribution called Markov chain Monte Carlo.

MCMC uses Markov chains to sample. A random sample from a finite set Ω means a random uniform selection from Ω, i.e., one selection such that each element has the same chance 1⁄|Ω| of being chosen.

162

Suppose 휋 is a probability distribution on Ω. If a Markov chain (푋푡) with stationary distribution 휋 can be constructed, then, for 푡 large enough, the distribution of (푋푡) is close to 휋.

 Metropolis chains:

Problem: Given a probability distribution 휋 on Ω and some Markov chain with state space Ω and an arbitrary stationary distribution, can the chain be modified so that the new chain has the stationary distribution 휋?

Answer: Yes, it does. The Metropolis algorithm solves this problem.

We distinguish two cases: symmetric base chain and general base chain. In both cases we are given an arbitrary probability distribution 휋 on Ω and a base Markov chain (푋푡) with transition matrix ψ. We want to construct a new chain (푌푡) starting from the base chain (푋푡) and modifying its transitions such that the stationary distribution of the new chain is 휋. The new chain is called the Metropolis chain. The stationary distribution of (푋푡) or, equivalently, the transition matrix ψ are also referred as the proposal distribution.

 Let (푋푡) have a symmetric transition matrix ψ. This implies that (푋푡) is reversible with respect to the uniform distribution on Ω. The Metropolis chain is executed as follows.

It starts from the initial state of the base chain and evolves as follows: when at state 푥, a candidate state 푦 is generated from the distribution ψ(푥,·). The state 푦 is “accepted” with probability 푎(푥, 푦), which means that the next state of the new chain is 푦, or the state 푦 is “rejected” with probability 1 − 푎(푥, 푦), which means that the next state of the new chain remains at 푥. The acceptance probability 푎(푥, 푦) is:

휋(푦) 휋(푦) 푎(푥, 푦) = 1 ∧ ( ) = min (1, ) (C21) 휋(푥) 휋(푥)

Therefore, the Metropolis chain for a probability 휋 and a symmetric transition matrix ψ is defined by the following transition matrix:

휋(푦) ψ(푥, 푦) ∙ [1 ∧ ] , if 푦 ≠ 푥 휋(푥) 푃(푥, 푦) = 휋(푧) (C22) 1 − ∑ ψ(푥, 푧) ∙ [1 ∧ ] , if 푦 = 푥 ( ) 푧 휋 푥 { 푧≠푥

163

( ) A very important feature of the Metropolis chain is that it only depends on the ratios 휋 푦 . 휋(푥) ℎ(푥) Frequently 휋(푥) has the form , where the function ℎ ∶ Ω ⟶ [0, ∞) is known and 푍 is a 푍

normalizing constant 푍 = ∑푥∈Ω ℎ(푥). It may be difficult to explicitly compute 푍, especially ℎ(푥) if Ω is large. Because the Metropolis chain only depends on , it is not necessary to ℎ(푦) compute the constant 푍 in order to simulate the chain.

 Let (푋푡) have a general transition matrix ψ; this means ψ corresponds to an irreducible

(푋푡).

The Metropolis chain is executed as follows. It starts from the initial state of the base chain and evolves as follows: when at state 푥, generate a state 푦 from the distribution ψ(푥,·). Then move to 푦 with probability 푎(푥, 푦) and remain at 푥 with the probability 1 − 푎(푥, 푦). The acceptance probability 푎(푥, 푦) is:

휋(푦) ∙ ψ(푦, 푥) 휋(푦) ∙ ψ(푦, 푥) 푎(푥, 푦) = 1 ∧ ( ) = min (1, ) (C23) 휋(푥) ∙ ψ(푥, 푦) 휋(푥) ∙ ψ(푥, 푦)

Therefore, the Metropolis chain for a probability 휋 and a general transition matrix ψ is defined by the following transition matrix:

휋(푦) ∙ ψ(푦, 푥) ψ(푥, 푦) ∙ [1 ∧ ( )] , if 푦 ≠ 푥 휋(푥) ∙ ψ(푥, 푦) 푃(푥, 푦) = 휋(푦) ∙ ψ(푦, 푥) (C24) 1 − ∑ ψ(푥, 푧) ∙ [1 ∧ ( )] , if 푦 = 푥 ( ) ( ) 푧 휋 푥 ∙ ψ 푥, 푦 { 푧≠푥

The transition matrix 푃 defines a reversible Markov chain with stationary distribution 휋.

 Glauber dynamics (Gibbs sampler):

In general, let 푉 and 푆 be finite sets and Ω be a subset of 푆푉: Ω ⊆ 푆푉. 푉 can be seen as the vertex set of a graph and 푆 can be seen as the set of state values for any vertex in the graph. The elements of 푆푉 are called configurations and can be “visualized” as labeling the vertices of 푉 with elements of 푆.

Problem: Given a probability distribution 휋 on a space of configurations Ω ⊆ 푆푉, can we find a Markov chain for which 휋 is its stationary distribution?

Answer: Yes, we do. The Glauber dynamics algorithm solves this problem.

164

Let 휋 be a probability distribution whose support is Ω. For a configuration 휎 ∈ Ω and a vertex 푣 ∈ 푉 let Ω(휎, 푣) be the set of configurations agreeing with 휎 everywhere except possibly at 푣:

Ω(휎, 푣) = {휏 ∈ Ω ∶ 휏(푤) = 휎(푤) for all 푤 ∈ 푉, 푤 ≠ 푣} (C25)

The (single–site) Glauber dynamics for 휋 is a reversible Markov chain with state or configuration space Ω, stationary distribution 휋, and the transition probabilities defined by the distribution 휋 conditioned on the set (휎, 푣) as follows:

휋(휏) , if 휏 ∈ Ω(휎, 푣) 휋휎,푣(휏) = 휋(휏 | Ω(휎, 푣)) = {휋(Ω(휎, 푣)) (C26) 0, if 휏 ∉ Ω(휎, 푣)

In words, the Glauber chain moves from a configuration 휎 ≡ 푋푡 to a configuration 휏 ≡ 푋푡+1 as follows:

 a vertex 푣 is chosen uniformly at random from 푉;  a new configuration 휏 ∈ Ω is chosen according to the probability measure 휋 conditioned on the set of configurations that agree with 휎 everywhere except possibly at 푣.

 Comparing Glauber dynamics and Metropolis chains:

Suppose that 휋 is a probability distribution on the state space 푆푉, where 푆 is a finite set and 푉 is the vertex set of a graph. On one hand, we can always define the Glauber chain as just described. On the other hand, suppose that we have a chain which picks a vertex 푣 at random and has some mechanism for updating its configuration 휎 at 푣. This chain may not have stationary distribution 휋, but it can be modified by the Metropolis rule to obtain a Metropolis chain with stationary distribution 휋. The Metropolis chain obtained in this way can be very similar to the Glauber chain, but may not coincide exactly.

165