<<

Learning for Neural Network Models

Using Hamiltonian Dynamics

by

Kiam Cho o

A thesis submitted in conformity with the requirements

for the degree of Master of Science

Graduate Department of Computer Science

UniversityofToronto

c

Copyright by Kiam Cho o

Abstract

Learning Hyp erparameters for Neural Network Mo dels Using Hamiltonian Dynamics

Kiam Cho o

Master of Science

Graduate Department of Computer Science

UniversityofToronto

We consider a feedforward neural network mo del with hyp erparameters controlling groups

of weights Given some training data the p osterior distribution of the weights and the

hyp erparameters can b e obtained by alternately up dating the weights with hybrid Monte

Carlo and sampling from the hyp erparameters using However this

metho d b ecomes slowfornetworks with large hidden layers We address this problem

by incorp orating the hyp erparameters into the hybrid Monte Carlo up date However

the region of state space under the p osterior with large hyp erparameters is huge and

has low probability density while the region with small hyp erparameters is very small

and very high densityAshybrid Monte Carlo inherently do es not movewell b etween

such regions we reparameterize the weights to makethetwo regions more compatible

only to b e hamp ered by the resulting inability to compute go o d stepsizes No denite

improvement results from our eorts but we diagnose the reasons for that and suggest

future directions of research ii

Dedication

I dedicate this thesis to my family who have accepted mywanderings over the years I

esp ecially dedicate this to my mother whose strength and will I carry on

Acknowledgements

I thank Prof Radford Neal for his invaluable guidance on this thesis I also thank Faisal

Qureshi for his helpful suggestions and friendship during this thesis Thanks also to the

other inhabitants of the Articial Intelligence Lab oratory who have help ed me in some

way to complete this thesis iii

Contents

Intro duction

Overview

The Neural Network Learning Problem

Bayesian Approach to Neural Net Learning

A Simple Example

Making Predictions

Determining the Hyp erparameters From the Data

Motivation

The Hybrid Monte Carlo Metho d

Background on Markovchain Monte Carlo Sampling

The Metrop olis with Simple Prop osals

The Hybrid Monte Carlo Metho d

Leapfrog Prop osals

Stepsize Selection

Convergence of Hybrid MonteCarlo

Hyp erparameter Up dates Using Gibbs Sampling

Neural Network Architecture

Neural Network Mo del of the Data iv

Posterior Distributions of the Parameters and the Hyp erparameters

Sampling From the Posterior Distributions of the Parameters and the Hy

p erparameters

Convergence of the Algorithm

Ineciency Due to Gibbs Sampling of Hyp erparameters

Hyp erparameter Up dates Using Hamiltonian Dynamics

The New Scheme

TheIdea

The New Scheme in Detail

Reparameterization of the Hyp erparameters

Reparameterization of the Weights

First Derivatives of the Potential Energy

First Derivatives with Resp ect to Parameters

First Derivatives with Resp ect to the Hyp erparameters

Approximations to the Second Derivatives of the Potential Energy

Second Derivatives with Resp ect to the Parameters

Second Derivatives with Resp ect to the Hyp erparameters

Summary of Compute Times

Computation of Stepsizes

Compute Times for the Dynamical A Metho d

Results

Training Data

Verication of the New Metho ds

Results From Old Metho d

Results of New Metho ds Compared with the Old

Metho dology for Evaluating Performance v

The of Means MeasurementofPerformance

Error Estimation for VarianceofMeans

Geometric Mean of Variance of Means

Iterations Allowed for EachMethod

Markovchain start states

Master Runs

Starting States Used

Mo died Performance Measures Due to Stratication

Numb er of Leapfrog Steps Allowed

Results of Performance Evaluation

Pairwise Bo otstrap Comparison

Discussion

Has the Reparameterization of the Network Weights Been Useful

Making the Dynamical B Metho d Go Faster

Explanation for the Rising Rejection Rates

The Appropriateness of Stepsize Heuristics

Dierent Settngs for

h p

Fine Splitting of Hyp erparameter Up dates

Why the Stepsize Heuristics are Bad

Other Implications of the Current Heuristics

Conclusion

A Preservation of Phase Space Volume Under Hamiltonian Dynamics

B Pro of of Thorem Deterministic Prop osals for Metrop olis Algorithm

C Preservation of Phase Space Volume Under Leapfrog Up dates vi

Bibliography vii

Chapter

Intro duction

Overview

A feedforward neural network is a nonlinear mo del that maps an input to an output It

can b e viewed as a nonparametric mo del in the sense that its parameters cannot easily

be interpreted to provide insightinto the problem that it is b eing used for Nevertheless

feedforward neural networks are p owerful as with sucient hidden units they can learn to

approximate any nonlinear mapping arbitrarily closely Cyb enko Partly b ecause

of this exibility they have b ecome widespread to ols used bymany practitioners in

the sciences and engineering These practitioners typically use wellestablished learning

techniques likebackpropagation Rumelhart et al or its variants But despite the

multitude of learning metho ds already in existence learning for feedforward networks

remains an area of active research

A recent approach to feedforward neural net learning is Bayesian learning Buntine

and Weigend MacKay Neal Muller and Insua This new

approach can b e viewed as a resp onse to the problem of incorp orating prior knowledge

into neural networks However the computational problems in Bayesian learning are

complex and none of the existing techniques are p erfect In the interests of computational

Chapter Introduction

tractability b oth the works of MacKay and Buntine and Weigend

assume Gaussian approximations to the p osterior distribution over network weights

A more general and exible approach is to sample from the p osterior distribution of

the weights as has b een done by Neal and Muller and Insua Neal obtains

samples by alternating hybrid Monte Carlo up dates of the weights with Gibbs sampling

up dates of the hyp erparameters Muller and Insua also alternately up date the weights

and Gibbssample the hyp erparameters but in addition they observe that given all

weights except for the hiddentooutput ones the p osterior distribution of the latter is

simply Gaussian when the data noise is Gaussian While the other weights still need

to b e up dated by a more complicated Metrop olis step this do es allow them to sample

directly from the Gaussian distribution of the hiddentooutput weights However as

will b e describ ed later b oth metho ds are exp ected to b ecome slow for large networks

p ossibly to the p oint where they b ecome unusable

This thesis addresses the ab ove ineciency for large networks Sp ecicallyitis

concerned with improving on the hybrid Monte Carlo technique used by Neal so that

b oth parameters and hyp erparameters are up dated using hybrid Monte Carlo

The Neural Network Learning Problem

The rest of this thesis is ab out feedforward neural networks onlysowe drop the feed

forward for simplicity

In this section we dene the neural network learning problem that underlies this

thesis

N N

c c

c c

Given a set of inputs X fx g and targets Y fy g a neural network can b e

c c

used to mo del the relationship b etween them so that

c c

f x W y

where f W is the function computed by the neural network with weights W This

Chapter Introduction

mo deling is achieved by training the weights W using the training data consisting of

the inputs X and the targets Y Once training is complete the neural net can b e used

to predict targets given previously unseen values of inputs

Conventionally the learning pro cess is viewed as an optimization problem where the

weights are learned using some kind of gradient descent metho d on an error function such

as the following

N

c

X



c c

E W f x W y

c

The result of this pro cedure is a single optimal set of weights W that minimizes

opt

the error This single set of weights is then used for future predictions from a new input

The conventionallytrained network prediction is thus

f x f x W

C opt

Bayesian Approach to Neural Net Learning

The Bayesian approach to neural network learning diers fundamentally from the con

ven tional optimization approach in that rather than obtaining a single b est set of

weights from the training pro cess a over the weights is obtained

instead

Bayesian Inference

Generally sp eaking Bayesian inference is a waybywhich unknown prop erties of a system

may b e inferred from observations In the Bayesian inference framework we mo del the

observations z as b eing generated by some mo del with unobserved parameters Sp ecif

icallywe come up with a likeliho o d function pz j the probability of the observable

state z given a particular setting for the parameter Next we decide on p the prior

probability distribution over parameters

Chapter Introduction

With these two functions in hand we use Bayes rule to infer the p osterior probability

that the parameters have the value when weobserve the state z

p pz j

p jz

pz

Bayesian learning can b e applied to neural networks in the following wayWemodel

the targets as the neural network output f x W plus some noise which denes the

likeliho o d pY jWX and we assume some form for the prior distribution of the weights

pW The p osterior distribution of the weights is then

pW pY jWX

pW jX Y

pY jX

where wehave set pW jX P W since the prior distribution of the weights do es not

dep end on the inputs pW jX Y in Eqn is the probability distribution over weights

that we infer in the Bayesian framework

A Simple Example

As a simple example assuming that the noise in the output of each unit is Gaussian with

xed standard deviation we get for a net with N outputs

y

N

N N

c

c y

X



c c

p

pY jWX exp f x W y



c

N

w

have Gaussian And assuming a simple prior where all the weights W fw g

i

i

distribution of xed inverse variance

N

N

w

w

X



pW exp w

i

i

This gives the p osterior distribution for W

pW jX Y pW pY jWX

N N

w c

X X



c c 

f x W y exp w

i



i c

Chapter Introduction

Here the symbol denotes prop ortionalityWehave dropp ed the normalizing con

stantpY jX aswell as factors not dep endentontheweights This is b ecause we are

considering the p osterior distribution over the weights only in this simple mo del with

and xed

Making Predictions

The prediction of the net trained using Bayesian inference is obtained as the exp ected

output over all p ossible weight settings weighted by their p osterior probabilities

Z

f xE f x W f x W pW jX Y dW

B

W jXY

Compared to Eqn Bayesian prediction is clearly more complicated

Determining the Hyp erparameters From the Data

The inverse variance of the weights is called a hyp erparameter b ecause it is a parameter

that controls the prior distribution of the parameters w In practice it is reasonable to

i

let the h yp erparameters b e determined from the data For instance the inputtohidden

weights for one training set might need to b e larger than for another training set b ecause

its outputs vary more rapidly Evidently it is p ossible to infer the hyp erparameters from

the training data

But we infer the hyp erparameters not just b ecause it is p ossible but b ecause it is

desirable as well This is b ecause it is dicult for a human op erator to guess a go o d setting

of the hyp erparameters but it is easier to guess a prior distribution for hyp erparameters

eg in terms of its mean and some measure of its spread Moreover allowing the precision

to vary in Eqn couples the weights in their prior distribution and allows for a richer

prior whereas all the weights would b e indep endent in their prior if were xed Details

of the incorp oration of the hyp erparameters into the sampling pro cedure will b e given in

later chapters

Chapter Introduction

Once wehave the p osterior distribution of the weights and the hyp erparameters the

network has b een trained Predictions from the net nowinvolvethejoint p osterior

distribution of the weights and the hyp erparameters

Z

f xE f x W f x W pW jX Y dW d

B WjXY

The ab oveintegral usually cannot b e analytically obtained for neural network mo dels

Note that the variance of the data is often also regarded as a hyp erparameter

b ecause the role it plays in controlling the network error is similar to that played by

other hyp erparameters in controlling their resp ectiveweights

Motivation

The practicalityoftheBayesian framework hinges on the existence of computationally

ecientways to evaluate or approximate Eqn The main problem is that the

p osterior distribution pW jX Y is often such that the integral in Eqn cannot

b e p erformed analytically

In the interests of computational feasibilityBuntine and Weigend and MacKay

approximate the p osterior distribution of the weights and hyp erparameters

as a Gaussian distribution Unfortunately it usually cannot b e seen in advance from the

training data if a Gaussian distribution is a reasonable approximation to the p osterior

distribution For instance for a small network that has just enough hidden units to mo del

some given data wewould exp ect that ignoring multiple mo des due to units swapping

roles the p osterior distribution is p eaked at a single mo de b ecause each unit has a well

constrained role to play in the mapping In such a case one might reasonably exp ect the

p osterior to b e approximately Gaussian However when there are more hidden units

units are no longer so constrained and the p osterior distribution will b e broader in ways that do not necessarily retain a Gaussian app earance

Chapter Introduction

Neals and Muller and Insuas approach to the problem is to sample from

the p osterior distribution using Markovchain Monte Carlo MCMC techniques MCMC

techniques do not approximate the p osterior distribution as a Gaussian but instead

sample faithfully from its true form With n samples from the p osterior distribution we

can obtain the exp ected output of the net as

n

X

f x f x W

B i i

n

i

In order for this metho d to b e eective each sample W must b e as indep endent

i i

of the previous sample as p ossible A common problem of MCMC techniques is that

samples can b e highly correlated in whichcaseeven though they are drawn from the

correct distribution they sample the distribution very slowlyandahuge number of

samples might b e needed for reliable estimates In severe cases the metho d b ecomes

infeasible for practical use The MCMC technique used by Neal faces this problem when

the numb er of hidden units b ecomes large His metho d alternates b etween using hybrid

Monte Carlo to up date the network parameters and Gibbs sampling to sample the

hyp erparameters Unfortunately it is this alternation b etween up dating the parameters

and the hyp erparameters that causes high correlations from one sample to the next as

the numb er of hidden units b ecomes large

The ro ot of this ineciency is that during eachhybrid Monte Carlo pro cess that

yields one sample of the weights the hyp erparameters are held xed This would not

b e a problem if to obtain the next sample of the weights the hyp erparameters can b e

shifted to an uncorrelated value However b ecause the hyp erparameters are up dated

using Gibbs sampling given the currentvalues of the weights they are pinned and

unable to movemuch The larger the numb er of hidden units the greater the pinning

eect is

Muller and Insuas metho d suers from the same ineciency as it also alternates

Gibbs sampling of the hyp erparameters with Markovchain up dates of the network pa rameters

Chapter Introduction

This problem is the motivation for this thesis In this thesis we prop ose and inves

tigate a mo dication to the hybrid Monte Carlo technique used by Neal Sp ecically

rather than up dating the weights byhybrid Monte Carlo and the hyp erparameters by

Gibbs sampling we up date b oth the weights and the hyp erparameters using hybrid

Monte Carlo The idea is that b ecause b oth the weights and the hyp erparameters are

nowchanging at the same time we no longer have the pinning eect and hybrid Monte

Carlo should then b e able to pro duce samples that movemuch more eciently through

the p osterior distribution

Of course one mightaskwhywewould want to use large hidden layers in the rst

place There are several reasons for this Firstly since a neural network is a nonpara

metric mo del it makes sense when mo delling some data to use a lot of hidden units in

order to maximize the networks p ower of representation That is wewant the function

computed by the neural network to not b e constrained by there b eing to o few units and

b e determined instead by the data Secondly small numb ers of hidden units often leads

to lo cal maxima in the p osterior distribution of the weights b ecause the few available

hidden units can get trapp ed into representing sub optimal features in the data leaving

no spare unused units to seek out the imp ortant features Using a larger hidden layer

tends to connect the m ultiple mo des into ridges and thus improves mobility Finallyin

his b o ok Neal has shown that the prior distribution of a neural network b ecomes

tractable for innitesized hidden layers So using many hidden units allows for a more

precise sp ecication of a neural networks prior distribution Incidentallyovertting is

not a problem in the rst justication given ab ove b ecause Neals results showhowto

assign appropriate priors for increasing network size

This thesis is organized as follows As an eort to make this a selfcontained work

Chapter lays down the background material on Markovchains and the hybrid Monte

Carlo metho d that is necessary and hop efully sucient to understand the rest of what

follows Chapter describ es the original metho d of Neal Chapter presents the new

Chapter Introduction

metho d that is the sub ject of this thesis Results of the investigations of the new metho d

are presented in Chapter followed by the discussions and conclusions in the nal

chapter

Chapter

The Hybrid Monte Carlo Metho d

In this chapter weintro duce the hybrid Monte Carlo metho d whichisthemainmethod

bywhichwe sample from the p osterior distribution of a neural network

Background on Markovchain Monte Carlo Sam

pling

In this section we present some necessary background on Markovchains at a level of

technical detail sucient to explain the rest of this work More detailed presentations

may b e found elsewhere suchasFellers b o ok

Denition Markovchain A is a series of random variables X X

 

X etc such that



P X jX X X P X jX

i i i  i i

That is given X X is independent of al l earlier X s

i i

A Markov chain is dened by the state space S in which the X s live the distribution

i

over the initial state P X and the transition probability function P X jX

 i i

Chapter The Hybrid

For us the utility of Markovchains lies in the fact that under the right conditions

they converge to some probability distribution regardless of starting state In other

words indep endent of starting state in the limit of large n X will b ecome a sample

n

from a particular distribution Qx This allows us to obtain samples from p osterior

distributions that arise in probabilistic inference

Below we present a theorem obtained from Rosenthal that tells us the con

ditions under whichaMarkovchain converges to a distribution In this presentation

probability density functions will b e used in twoways with a state as an argument or

with a set as an argument Thus px will refer to the probabilitydensityat xwhile

pA will mean the total probabilitymassinthesetA First we need the following

denitions Let S b e the state space of the Markovchain

Denition Multitransition probability For x S and A S we dene the

n

multitransition probability T x A as the probability of ending up in the set A after n

transitions according to the Markov transition probabilities given that we startedatstate

n

x T x A is real ly P X AjX x For one transition we also write T x A

n 



rather than T x A

Denition Invariant distribution x is an invariant distribution of a Markov

chain with transitions T x A if for al l sets A S

Z

A dy T y A

where dy is a set of innitesimal volume at state y We also say that the Markov chain

leaves x invariant

Note that a Markovchain maynothaveaninvariant distribution and if it has one

it may not b e unique

Chapter The Hybrid Monte Carlo Method

Denition Total variation distance The total variation distance between two

probability distributions p and q on S is given by

jp q j supjpA q Aj

AS

Supp ose we start a Markovchain from state x S Then dep ending on the history

of transitions it takes it will takevarying numb ers of steps to enter the set A S of

nonzero volume if it do es at all Let b e the historydep endent that

A

denotes the rst time the Markovchain enters Aie inffn X Ag Note

A n

that could equal innity Then wehave the following imp ortant denitions ab out the

A

mixing prop erties of a Mark ovchain

Denition Irreducibility A Markov chain is irreducible if for any set A S of

nonzero volume P for al l starting points x S That is any starting

x A

point x has some probability of going to any nonzero volume within a nite number of

steps

Denition Ap erio dicity A Markov chain is ap erio dic if theredoes not exist a

S S S

partition of the state space S S S S for some m such that T x S

  m i

for al l x S with i to m and T x S for al l x S

i  m

If a Markovc hain has an invariant distribution and it is b oth irreducible and ap e

rio dic then it converges to that invariant distribution This theorem presented b elow

without pro of is a slightly mo died version of the one given by Rosenthal who

also proves it

Theorem Markov Chain Convergence Let T x A be the transition probabilities

for an irreducible aperiodic Markov chain having invariant distribution x onastate

space S Then for al l x S such that x

n

lim jT x j

n

Chapter The Hybrid Monte Carlo Method

That is as the number of transitions goes to innity the total variation distance

between the invariant distribution and the distribution of the Markov chain startedfrom

any state x such that x goes to

Using the ab ove theorem we can construct Markovchains that converge to a desired

distribution by ensuring that it is ap erio dic irreducible and has the target distribution as

an invariant distribution However while the ab ove theorem guarantees convergence in

theoryitdoesnotsayanything ab out the sp eed with which convergence is achieved This

is imp ortant as the initial p ortion of a Markovchain is typically not representativeofthe

invariant distribution and needs to b e discarded in order not to bias the distribution

Moreover a badlyconstructed Markovchain can converge far to o slowly to b e useful

in practice Nevertheless having one that converges to the correct distribution and

knowing that it do es is a go o d start

A Markovchain that is constructed to generate samples from some target distribution

is known in the literature as a Markovchain Monte Carlo MCMC metho d An example

of an MCMC metho d that is commonly used for multivariate distributions is Gibbs

sampling Gibbs sampling consists of up date steps where eachvariable is up dated in

turn During each up date a variable is replaced by a sample from its target distribution

conditional on all the other variables having their currentvalues Note that the new

value of the variable is chosen without reference to the old value it replaces This leaves

the desired distribution invariant b ecause the resulting multivariate state is an outcome

drawn according to the target distribution Furthermore b ecause all values of a variable

have nonzero probability of b eing generated the metho d is irreducible and ap erio dic so

long as all the variables get up dated at some p oint

Although it is conceptually simple Gibbs sampling requires that one is able to sample

from the conditional distribution of eachvariable For complicated distributions likethe

neural network p osteriors in this thesis this is usually not p ossible Other schemes exist

that do not have this requirement Below we present the Metrop olis algorithm with

Chapter The Hybrid Monte Carlo Method

simple prop osals which requires only that webeabletoevaluate the target probability

densityatagiven state but whose weaknesses will motivate the more sophisticated

hybrid Monte Carlo metho d used in this thesis

The Metrop olis Algorithm with Simple Prop os

als

The Metrop olis algorithm Metrop olis et al is a wellknown algorithm for con

structing a Markovchain with a desired invariant distribution

Let X b e the desired invariant distribution Supp ose our Markovchain currently

has state X The Metrop olis algorithm amounts to the following Markovchain transition

i

rule First prop ose a transition to a new state X from the current state X where the

i

i

prop osal probability density M X X must b e symmetric that is

i

i

M X X M X X

i i

i i

M X X is the probability density of going to X given that wewere originally at X

i i

i i

Next accept the prop osed state as the next Markovchain state X with the following

i

probability

X

i

P acceptmin

X

i

If we reject the state X is set to b e the previous state X

i i

One can show that the Metrop olis algorithm guarantees that the target distribution

X isaninvariant distribution of the Markovchain However it do es not guarantee

that the Markovchain is irreducible and ap erio dic

Let us consider the p erformance of the Metrop olis algorithm in sampling from some

target distribution when we use a simple Gaussian prop osal with a xed covariance

Chapter The Hybrid Monte Carlo Method

centred on the current p oint X

i

T 

P X exp X X X X

i i

i i i

d

jj

where d is the dimensionality of the state space This example will illuminate the key

issues in sampling a distribution with the Metrop olis algorithm

If the variance of the Gaussian prop osals are to o large compared with the width of the

target distribution the Metrop olis algorithm almost always rejects as prop osals usually

end up in regions of low target probability Clearly this may lead to a slow exploration of

the state space and a prop osal with a smaller variance and higher acceptance rate maybe

b etter Indeed with the exception of sp ecial cases liketwodimensional Gaussian target

distributions fairly high acceptance rates are b etter than very low acceptance

rates But in order to keep the acceptance rate high the standard deviation of the

prop osal distribution must b e of a size comparable to the distributions thinnest cross

section and so the steps taken must b e very small compared to the overall distribution

if the distribution is very thin in one direction but very long in others

Thus the rst problem is that the presence of a long thin region in a distribution

constrains suchascheme to take steps whichmaybevery small compared to the size

of the overall distribution This by itself is not so bad if the direction of the next

step is somehow correlated with that of the rst However it is not and that is the

second problem the next step is chosen indep endently of the rst and b ecause it has

the p ossibility of doubling back on the rst step a results This eect is

illustrated in Fig

We exp ect that the p osterior distribution of a neural networks w eights is complicated

under most circumstances and mightpotentially have long narrow regions Thus to

sample from the p osterior distribution of a neural network using the Metrop olis algo

rithm with Gaussian prop osals wewould need to use prop osal distributions with small

This leads to inecient random walks as describ ed ab ove A metho d that is more appropriate for the complicated p osteriors seen in neural

Chapter The Hybrid Monte Carlo Method

1

0.8

0.6

0.4

0.2

y 0

−0.2

−0.4

−0.6 finish

−0.8 start −1

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

x

Figure Illustration of random walk when using the Metrop olis algorithm with sim

ple Gaussian prop osals to explore a two dimensional Gaussian distribution Here the

standard deviation of the Gaussian prop osals was and of the samples obtained

there were rejections Large steps are actually more ecient for the sp ecial case of

atwodimensional Gaussian target distribution this gure serves as an illustration of

random walks only

Chapter The Hybrid Monte Carlo Method

network mo dels is the hybrid Monte Carlo algorithm which addresses the random walk

problem byhaving auxiliary momentum variables that allowittokeep going in the

same direction for many steps This means that in the case of hybrid Monte Carlo the

Metrop olis rejection test is applied only after many steps to giveitachance at travelling

a long distance

The Hybrid Monte Carlo Metho d

The hybrid Monte Carlo metho d rst used in physics byDuaneet al can b e

thought of as a Metrop olis algorithm with a sophisticated prop osal In this section

we describ e howthehybrid Monte Carlo metho d works Neal has written

relevant exp ositions of this technique but we include it here for completeness We will

use the symb ols C and C to denote normalizing constants

In hybrid Monte Carlo we asso ciate a physical system with the distribution that

wewant to sample from In essence wesimulate the movement of a particle moving

inapotential energy well equal to the negative log of the probability density for the

distribution that wewant to sample from Each iteration consists of randomizing the

velo city of this particle simulating its motion for some time and then obtaining its

p osition which b ecomes a new sample

d d

Supp ose that we wish to sample from the distribution P q where q is

then the state space of our asso ciated physical system and q is a state of the system

We augmenteach state variable q with a momentum variable p andwe dene the

i i

Hamiltonian

H q pE qK p

where the p otential energy E q is obtained from the desired distribution as

E q log P q log Z

Chapter The Hybrid Monte Carlo Method

d

for anychoice of Z and the kinetic energy K p is dened with the set of masses fm g

i

i

d



X

p

i

K p

m

i

i

Hybrid Monte Carlo allows us to set up a Markovchain that converges to C expE q

K p as its unique invariant distribution By ignoring the values of pwe obtain samples

of q drawn from the target distribution P q since this is the marginal distribution

In the hybrid Monte Carlo metho d wesimulate the time evolution of the physical

system with the ab ove Hamiltonian using Hamiltonian dynamics which is given by

dp

rE q

dt

dq p

dt m

i

Let us assume for nowthatwehave the ability to do the simulation with p erfect

accuracy Since Hamiltonian dynamics leaves H invariantandkeeps phase space volume

constant see App endix A simulating the system over any xed length of time yields

a new pair q pthatleaves any distribution that is a function of H invariant In

particular it leaves C expH q p C expE q K p invariant

However a Markovchain that consists of only this up date is not irreducible as all

p oints generated from a starting p ointnever leavea hyp ershell of constant H thus vio

lating the irreducibility requirementforThmTo rectify this situation we up date the

hain has some chance of reaching momentum variables in sucha way that the Markovc

all the other values of H after some numb er of iterations Sp ecically b efore the simu

lation of Hamiltonian dynamics we replace all the momentum variables by new values

drawn from the distribution C expK p Again this up date leaves the distribution

C expE q K p invariant since it draws p from the correct conditional distribution

which happ ens to b e indep endentof q

So the joint up date consisting of the momentum up date followed by the Hamiltonian

dynamics simulation is a Markovchain that leaves C expE q K p invariant If

we can construct such a Markovchain and prove that it is irreducible and ap erio dic

Chapter The Hybrid Monte Carlo Method

then wehave a Markovchain that converges to the desired distribution C expE q

K p and we can obtain the desired samples by ignoring p Moreover if during the

Hamiltonian simulation we follow the dynamical tra jectory of a state for a long time we

might obtain a state that is much less correlated with the original state than a Metrop olis

algorithm with simple prop osals

The metho d presented thus far is not the actual hybrid Monte Carlo algorithm but

it contains all the essential ideas What is dierent ab out hybrid Monte Carlo is that in

realitywe are unable to simulate Hamiltonian dynamics p erfectly Owing to the fact that

neural network mo dels are highly complex and so lead to nonintegrable Hamiltonians

wehave to settle for an approximate discretized simulation of Hamiltonian dynamics

followed by a Metrop olis rejection test that ensures that C expH iskept invariant

As b efore the up date consisting of momentum resampling followed by the simulation

keeps the desired distribution C expH invariant The conditions under whichthis

Mark ovchain is irreducible and ap erio dic dep ends on its details and we delay discussing

this until Section

Finallywe note that the discretized simulation is now also a Metrop olis prop osal

with the probability of rejection increasing as the simulation error as measured bythe

rise in H increases When we do the simulation well wekeep H almost invariantover long

tra jectories so it is in our interests to do the simulation as well as we can in order to have

a high acceptance rate even though simulation errors are corrected by the Metrop olis

rejection test to give the exact desired distribution

Leapfrog Prop osals

Because the discretized simulation used as a Metrop olis prop osal is deterministic the

standard reversibility condition for Metrop olis prop osals Eqn do es not apply

Instead the equivalent reversibility conditions for deterministic prop osals are that the

mapping that is the Metrop olis prop osal is its own inverse and that it has Jacobian

Chapter The Hybrid Monte Carlo Method

That is

Theorem Deterministic Prop osals for Metrop olis Algorithm If Y MX

is a deterministic mapping that satises the two conditions

Y

volume conservation

X

MMX X reversibility

then by accepting the update X MX with probability minMX X and

rejecting it otherwise the distribution X is left invariant

We givetheproofofthisinAppendixB

To satisfy the two conditions of volume conservation and reversibilityweuseadeter

ministic prop osal comp osed of leapfrog up dates to simulate the Hamiltonian dynamics

by p erforming a tra jectory of l steps each lasting time At the end of each tra jectory

we negate the momentum p Each leapfrog up date consists of

E

p t p t qt for each i d

i i

q

i

p t

i

for each i d q t q t

i i

m

i

E

p t p t qt for each i d

i i

q

i

Note that in the ab ovescheme all the comp onents are up dated b efore moving on

to the next line For instance all the comp onents of p t are calculated b efore the

i



up date for q is computed

To see that the leapfrog up date satises the volume conservation condition we note

that the change in each comp onentofeach state variable do es not dep end on itself and

so each comp onents up date amounts to a shear whichisavolumepreserving trans

formation and which therefore has Jacobian This is discussed in greater detail in App endix C

Chapter The Hybrid Monte Carlo Method

Also it is easy to check that the negation of the momentum at the end of a leapfrog

tra jectory of multiple leapfrog steps means that if a tra jectory takes us from p ointAto

p oint B then starting at B takes us backtoAThus leapfrog tra jectories also satisfy

the reversibility condition

We summarize the algorithm for the hybrid Monte Carlo in Algorithm and Algo

rithm Note that the momentum negation is not implemented as the momentum is

replaced by resampling b efore the next leapfrog tra jectory anyways

The algorithm discussed thus far avoids random walks byallowing long tra jectory

lengths However the actual algorithm implemented by Neal has an additional

optimization of the stepsizes that estimates the lo cal second derivative of the p otential

energy in order to take steps that are appropriately scaled in the various dimension This

is discussed next

Stepsize Selection

In using the hybrid Monte Carlo metho d the question of what values to cho ose for the

stepsize and for the masses m naturally arises As we shall see it turns out that

i

cho osing the masses is equivalenttocho osing dierent stepsizes in dierent dimensions

of the state variable and the careful choice of these stepsizes is necessary for hybrid

Monte Carlo to p erform well

It is clear from Eqn that since is the timestep of a discretized simulation

of Hamiltonian dynamics large values of cause an inaccurate simulation so that H

can wander far from its initial value In particular such an inaccurate simulation will

typically land the prop osed state in a region of low target probability This is analogous to

using a Gaussian prop osal with to o large a variance in the Metrop olis algorithm example

of Section and so having to reject frequently Thus keeping rejection rates low

requires careful selection of that is low enough and yet not so lowthatwe explore the

distribution lab oriously

Chapter The Hybrid Monte Carlo Method

Algorithm HybridMonteCarlonsampl es l q

init



q q

init

for each comp onent j do



p N mean variance m

j

j

end for

for i tonsampl es do

for eachcomponent j do

N mean variance m p

j

j

end for

i

p p

i i

q q

for j to l do

i i i i

q p Leapf r og U pdateq p

end for

i i i  i

if U min expH q p H q p then

i i

q q

i

p p

end if

end for

nsamples

i i

Return fq p g

i

Chapter The Hybrid Monte Carlo Method

Algorithm LeapfrogUp dateq p

for each comp onent i do

p p Eq q

i i i

end for

for each comp onent i do

q q m p

i i i i

end for

for each comp onent i do

p p Eq q

i i i

end for

Return q p

As describ ed by Neal for a toy quadratic Hamiltonian of the form

 

p q

H



H diverges under the leapfrog discretization if a stepsize is used whereas H stays

b ounded if To transfer this result to a general nonquadratic H q p we note

that near equilibrium samples are usually obtained near lo cal minimaof E q where it

can b e approximated by its Taylor expansion to second order Thus we exp ect a stepsize

  

Eq to b e appropriate near equilibrium

In the case where the state space is multidimensional but the Taylor expansion to

second order has no correlations b etween its dimensions we could set





E

min



i

q

i

in order to prevent the leapfrog simulation from diverging But if the Taylor expansion

has correlations b etween its dimensions the ab ovemight not b e small enough as the

stability of the leapfrog up dates is constrained by the narrowest crosssection which

might not b e axisaligned at all In general the stepsize hastobeadjusteddownwards

Chapter The Hybrid Monte Carlo Method

by dierentamounts dep ending on the shap e and orientation of the energy function To

take this into account weintro duce an op eratordened tuning parameter whichwe

call the stepsize adjustment factor and whichcontrols the stepsizes as follows





E

min



i

q

i

However cho osing the same stepsize to use in all directions may cause slow random

walklike exploration in those directions unless we use very long tra jectories whichmay

b e unnecessarily computationally intensive The underlying problem is that of making a

move that is compatible with the lo cal length scales of the distribution It is the same

problem that we encountered earlier in considering the Metrop olis algorithm with simple

prop osals but under a slightly dierent guise

Clearly it is preferable to use dierent and appropriate stepsizes for each direction

the values of whichwecho ose by lo oking at the lo cal length scales of the distribution

Ideallywewould like to set the stepsize for direction i based on the width of the

i

p otential energy b owl in the direction q

i





E

i



q

i

However one maywonder if the leapfrog up date with dierent stepsizes for dierent

comp onents still simulates Hamiltonian dynamics The answer is that the masses are

the extra degrees of freedom that enable us to implement dierent stepsizes in dierent

directions and still keep H approximately constant To see this we rst note that if we

p

m they b ecome rewrite the leapfrog equations in terms ofp p

i i i

E

p t p t qt

p

i i

m q

i i

p t q t q t

p

i i i

m

i

E

qt p t p t

p

i i

m q

i i

Chapter The Hybrid Monte Carlo Method

p

We set the stepsizes m and rewrite the leapfrog up dates as

i i

E

i

p t qt p t

i i

q

i

q t q t p t

i i i i

E

i

p t p t qt

i i

q

i

These new up dates are exactly equivalent to the original ones in Eqns except

wework in terms of rescaled momenta Should wecho ose to we can always recover the

old momenta after an up date But rather than using the original leapfrog up dates we

can work in terms ofp using the new massabsorb ed up dates We note the following

i

imp ortant facts ab out one massabsorb ed up date of q and p using Eqns

P

d



approxi Fact Each massabsorbedleapfrogupdate keeps H q pE q p

i

i

P

d



m approximately mately invariant This is because it keeps H q pE q p

i

i

i

invariant and the two H s areequal

Fact Each massabsorbedleapfrogupdate conserves phase space volume in the state

p

m This holds if the s space q p since p is relatedtop merely by the scale factor

i i i i

are set independently of the current value of q or p

Fact Each massabsorbedleapfrogupdate is reversible so long as the s are set without

i

using the current values of q and p since these are dierent at the beginning and at the

end of each step

In view of these three facts wehave the following revised algorithm that stores p

instead of p Before the leapfrog up dates we estimate the stepsize

i





E

i



q

i

indep endently of the current state using some problemdep endent heuristic The leapfrog

tra jectory now consists of massabsorb ed up dates at the end of whichwe apply the

P

d



Metrop olis rejection test using the Hamiltonian H q p E q p Since the

i

i

Chapter The Hybrid Monte Carlo Method

prop osals in the q p state space are reversible and conserve phase space volume the

Metrop olis rejection test can b e used to pro duce an up date that keeps expH q p

P

d



invariant Thus wehavea Markovchain that leaves expE q p invariant

i

i

We summarize hybrid Monte Carlo with stepsize selection in Algorithm where the

function Stepsizes computes the appropriate stepsize to use for each comp onent We

call the function LeapfrogUp date with a vector for its stepsize parameter unlikethe

scalar in Algorithm but what we mean here should b e clear The setting of the stepsizes

dep ends on the exact problem at hand For a neural network mo del Neal sets

them based on the currentvalues of the hyp erparameters These do not change over

the course of a leapfrog tra jectory in the scheme presented in his b o ok so that leapfrog

tra jectories are reversible

Convergence of Hybrid Monte Carlo

In this section we discuss the conditions under which this Markovchain algorithm con

verges to a unique invariant distribution

Thm tells us that in order for hybrid Monte Carlo to converge to a unique

distribution it must b e b oth irreducible and ap erio dic Whether or not this is true

dep ends on the details of the Hamiltonian the leapfrog tra jectory length and the stepsize

adjustment factor Although wehave no formal pro of wehave strong reasons to b elieve

that b oth conditions are satised in most neural network applications whose Hamiltonian

dynamics are highly nonlinear and whose Hamiltonians havevalues that are nite for

nite values of the state parameters

Let us rst discuss p erido cityFor most problems involving complex nonlinear Hamil

tonians such as the ones weencounter in neural network applications we exp ect that

periodicity is unlikely and if it should app ear is pathological rather than typical An

example of suchanunlikely p erio dicity is a case where the Hamiltonian dynamics takes

us exactly halfway or completely around a hyp ershell of constant H and this p erio dicity

Chapter The Hybrid Monte Carlo Method

Algorithm HybridMonteCarloWithStepsizesnsampl es l q

init



q q

init



p N mean variance I

for i tonsampl es do

p N mean variance I

S tepsiz es

i

p p

i i

q q

for j to l do

i i i i

q p Leapf r og U pdateq p

end for

i i i i

if U min expH q p H q p then

i i

q q

i

p p

end if

end for

nsamples

i i

Return fq p g

i

Chapter The Hybrid Monte Carlo Method

p ersists in hyp ershells of all values of H so that momentum resampling do es not avail

us of an escap e from p erio dicitySuch a situation app ears very unlikely for the highly

nonlinear neural network Hamiltonians that weusesowe exp ect that we will probably

always have ap erio dicity in practice Still if one wishes to b e on the safe side one can

vary the stepsize adjustment factors randomly over a small range and that should re

moveany p erio dicities Mackenzie This mo dication was not implemented in our

version of the algorithm

Irreducibility dep ends on the exact shap e of the p otential energy surface Let us

assume for now that our hybrid Monte Carlo algorithm is able to simulate Hamiltonian

dynamics p erfectly Then it seems intuitively clear that so long as the p otential energy

do es not b ecome innity for nite values of qhybrid Monte Carlo should b e irreducible

In particular while moving around in a lo cal minimumitalways has some probability

of gaining a suciently large kinetic energy from momentum replacementtoleaveitand

visit other parts of state space This can only b e prevented if that lo cal minimum is

b ounded bywalls of innite p otential energy And since our neural network Hamiltonian

is always nite for nite values of qwe exp ect that we will alwa ys have irreducibility

However hybrid Monte Carlo really only simulates Hamiltonian dynamics approximately

so it is conceivable that a nite p otential well could b e a trap like an innite one Never

theless the fact that hybrid Monte Carlo do es simulate Hamiltonian dynamics is reason

to b elieve that the ab ove argument for irreducibility should usually apply

Chapter

Hyp erparameter Up dates Using

Gibbs Sampling

Neural Network Architecture

In this thesis we concern ourselves with a neural network with one hidden layer only

The techniques describ ed here can readily b e extended to networks with more hidden

layers

The neural network mo del used here will have N input units N hidden units and

x h

N output units The parameters of the neural network are referred to collectivelyas

y

As shown in Fig they are

N

u

U fu g

i

Inputtohidden weights

i

N

v

V fv g

Hiddentooutput weights i

i

N

a

A fa g

i

Hidden biases

i

N

b

B fb g

Output biases i

i

N N N N N N N and N N where N

x h a h v h y b y u

We will sometimes use the alternative notation

Chapter Updates Using Gibbs Sampling

 

N output units

y

b

i

 

I

v

i

 

N hidden units

h

a

i

 

I

u

i

 

N input units

x

 

Figure Neural network architecture used in this work

U

ij

Inputtohidden weight from unit i to unit j

V

Hiddentooutput weight from unit j to unit k jk

A

j

Bias on hidden unit j

B

Bias on output unit k k

In this neural network we use the tanh nonlinearity and it o ccurs only at the single

N

x

the j th hidden unit output is hidden layer Thus for input vector fx g

i

i

N

x

X

U x A h tanh

ij i j j

i

while the output at unit k has no nonlinearity

N

h

X

f V h B

k jk j k

j 

Neural Network Mo del of the Data

The neural network mo del of the data is as follows Let  b e the error in the output of

the neural network for training input x and target y

 f x y

Chapter Hyperparameter Updates Using Gibbs Sampling

We assume that each comp onentof  has precision or inverse variance Then

given an input case x the parameters of the network and the probabilityofob

serving the target y is

N 

y



P y j x jf x y j exp

The prior distributions of the four groups of parameters are normal with means and

precisions The asterisk indicates the corresp onding group of network parameters and

maybe u v a or bFor instance for the inputtohidden weights U

N

N 

u

u

X

u u



exp u P U j

u

i

i

The prior distribution of the parameters is simply the joint distribution

P j P U j P V j P Aj P B j

u v a b

where denotes f g is the set of hyp erparameters whichcontrol the prior

u v a b

distribution of each group of parameters These prior distributions keep the network

parameters small and amount to a principled formulation of the weightdecay terms

found in the neural network training literature see Bishop

Rather than xing the hyp erparameters weallowthemtovary also and we let them

eachhave gamma distributions For instance for

u



u

u u



u

P exp

u u u u

u

u

which is a gamma distribution with mean and shap e parameter for each The

u u u

prior distribution of the hyp erparameters P is simply the joint distribution

P P P P P P

u v a b

In this work and arexedbyhand

Chapter Hyperparameter Updates Using Gibbs Sampling

Posterior Distributions of the Parameters and

the Hyp erparameters

We can infer the p osterior distributions for the parameters fU V A B g and the

hyp erparameters f g when the training data is observed In the Monte

u v a b

Carlo approach wedothisby obtaining samples from the distribution P jx y where

N N

c c

c c

x fx g and y fy g are the training data

c c

Neal samples from P jx y by obtaining a series of Markovchain samples

N

s

f g j xy is obtained by Gibbs sampling and is obtained where P

i i i i i

i

from asahybrid Monte Carlo up date that leaves the distribution P j xy

i i

invariant During the rst iteration is set to some mo derate values We discuss the



convergence prop erties of this Markovchain in Section

In the remainder of this section wegive the distributions P j x y and P j x y

which are required for the ab ove sampling scheme

Using Bayes Rule we obtain the p osterior distribution for as

P y j xP j x

P j x y

P y j

P y j xP j

N

c

Y

c c

P y j x P U j P V j P Aj P B j

u v a b

c

The reader is referred to Eqns and for the full expansion of the ab ove expres

sion

The p osterior distribution for P j x y is the probabilityofthehyp erparameters

conditioned on x and y

j x y P j x y P j x y P j x y P j x y P j x y P

u v a b

Consider the hyp erparameter Since each is the precision of its group of pa

rameters it can b e inferred solely from those parameters indep endently of x and y For

Chapter Hyperparameter Updates Using Gibbs Sampling

instance for

u

N

u

P j x y P jfu g

u u i

i

N

u

P fu g j P

i u u

i

N

u

X

u u

  N 

u u

u exp

i u

u

i

c

on the other hand is the noise in each comp onentof y is inferred from the

c

c c

errors  f x y

c N

c

j xP P j x y P f g

c

N

c

X

 N N 

c y

c c 

exp jf x y j

c

Note that Eqns and are gamma distributions

Sampling From the Posterior Distributions of

the Parameters and the Hyp erparameters

Since Eqns and are gamma distributions indep endent samples for the hyp er

parameters can b e drawn using wellknown techniques Devroye

Drawing samples from the p osterior distribution of the parameters is more dicult

For instance standard Gibbs sampling cannot b e used b ecause the conditional distribu

tion of each network parameter can b e a very complicated function due to the training

error terms A simple Metrop olis metho d suers from random walks as previously de

scrib ed So instead Neal obtains samples from the network parameters using

the hybrid Monte Carlo technique with stepsize selection as describ ed in Algorithm of

Section To sample from the p osterior distribution for the p otential energy is set

Chapter Hyperparameter Updates Using Gibbs Sampling

as follows

E log P j x y

N

N N N N

b c u v a

X X X X X

u v a b

 c c    

b const jf x y j u v a

i i i i

c

i i i i

where the constant is immaterial b ecause it do es not aect the dynamics of the system

that gives rise to the desired probability distribution This constant corresp onds to

prefactors in the distribution P j x y that do not dep end on The parameters

corresp ond to q in Algorithm

Convergence of the Algorithm

In this section we discuss the conditions under which this Markovchain algorithm con

verges to a unique invariant distribution

Recall that our algorithm alternately up dates the hyp erparameters by Gibbs sampling

and the parameters using hybrid Monte Carlo In Section wehave already discussed

the reasons whyhybrid Monte Carlo by itself should converge to a unique distribution

The question is combined with the hyp erparameter Gibbs sampling up date step do es

the resulting Markovchain converge

AMarkovchain up date is p erio dic so long as it is p erio dic in one of the parameters

of its state space so the fact that Gibbs sampling of the hyp erparameters has nonzero

probability of pro ducing anyvalue do es not immediately imply ap erio dicityHowever

when the hyp erparameters are up dated from one iteration to the next hybrid Monte

Carlo sees a random mo dication of the p otential energy surface that if anything would

prevent systematic b ehaviour lik e p erio dicity Thus the Gibbs sampling step renders

periodicityeven more unlikely than ever

When the hyp erparameters change from one iteration to the next hybrid Monte Carlo

sees a mo dication of the p otential energy surface that do es not intro duce any innite

Chapter Hyperparameter Updates Using Gibbs Sampling

barriers so the argument from Section still applies and we exp ect that with

sucient iterations all regions of the network parameters state space can b e visited

regardless of the values the hyp erparameters have Thus we exp ect that all regions of

the joint state space containing b oth parameter and hyp erparameter can b e visited after

a sucientnumb er of iterations and our algorithm should b e irreducible

Although wehave no formal pro of based on the ab ove arguments we exp ect that

our Markovchain do es converge to a unique distribution

Ineciency Due to Gibbs Sampling of Hyp erpa

rameters

Despite the fact that weavoid random walks in network parameter space given the hy

p erparameters Gibbs sampling of the hyp erparameters can lead to a slow random walk

in the joint state space of the hyp erparameters and the network parameters This is b e

cause the distribution of the parameters conditional on the hyp erparameters is restricted

by the conditioning on the hyp erparameters this restricts the p ossible values the param

eters are likely to visit and so the distribution of the hyp erparameters conditional on

the parameters are unlikely to change much from the previous iteration in order to b e

consistent with the parameters

This is esp ecially apparent when there are many hidden units Consider sampling

the inputtohidden weights u given When there are many of them they represent

i u

their distribution well with a variance close to Th us when is Gibbssampled

u u

we are likely to obtain a value close to again and so the Markovchain b ecomes highly

u

correlated from sample to sample

This problem can b e alleviated when there are many training cases In Eqn the

prior terms will b e of order N N N N When N N N N N N

u v a b c y u v a b

the likeliho o d term has a stronger eect in determining the shap e of the p otential energy

Chapter Hyperparameter Updates Using Gibbs Sampling

bowl sampled from and so the weights move mostly to t the data rather than to satisfy

the prior constraints imp osed by their precisions and so in this case the hyp erparameters

b ecome slaved to the weights which in turn are welldetermined by the data

When wedonothave the option of obtaining more training cases and we use large

numb ers of hidden units the random walk describ ed ab ove b ecomes an issue and may

slow the metho d down so much as to render it impractical to use The next chapter intro

duces the solution that is considered in this thesis that of up dating the hyp erparameters

using hybrid Monte Carlo as opp osed to Gibbs sampling

Chapter

Hyp erparameter Up dates Using

Hamiltonian Dynamics

The New Scheme

Toovercome the slowmovementofthehyp erparameters when using Gibbs sampling we

prop ose to up date the hyp erparameters using Hamiltonian dynamics in the same wayas

the parameters

The Idea

The problem with the old scheme is that the alternating up dates of the hyp erparameter

and its asso ciated parameters cause them to pin each other down resulting in Markov

chain moves that are small compared to the overall distribution and which can double

back since Markovchains have only state memory and no momentum memory This

doubling back is similar to the random walk b ehaviour of the Metrop olis algorithm with

simple prop osals Since hybrid Monte Carlo is our wayofovercoming that we hop e that

hybrid Monte Carlo by pro ducing tra jectories that can keep going in the same general

direction for long distances may also allowhyp erparameters to travel long distances in

Chapter Hyperparameter Updates Using Hamiltonian Dynamics

a single leapfrog tra jectory without doubling back This should lead to gains in how

rapidly the parameters and the hyp erparameters sample the p osterior distribution

The New Scheme in Detail

Rather than up dating the parameters conditional on the hyp erparameters and vice versa

our aim is now to up date the parameters and the hyp erparameters jointly according

to the joint p osterior distribution

P jx y P P j P y j xP y jx

P P j P y j x

where wehave dropp ed the normalizing constant P y jx and used the fact that given

y s dep endence on the hyp erparameters is restricted to just the noise hyp erparameter

From Eqns and we get

E log P jx y

N

c

X X

N

c c 

log jf x y j E

c

uv ab

where N N N the total numb er of target variables in the training data and

c y

N

u

X

N

u u u u



E log u

u u

i

u

i

and E E and E are similarly dened

v a b

The ab oveisthenewpotential energy that hybrid Monte Carlo must use in its

simulation Accordinglywenow expand the state space to include the hyp erparameters

We denote the p osition and momentum variables corresp onding to the parameters and

hyp erparameters with the subscripts and resp ectively

q q q

p p p

Chapter Hyperparameter Updates Using Hamiltonian Dynamics

However some complications now arise due to the necessity for the leapfrog prop osals

to b e symmetric The main problem is that the hyp erparameters at the b eginning and

at the end of a leapfrog tra jectory are now dierent so setting the parameter stepsizes

based on the hyp erparameters at the b eginning do es not lead to reversible dynamics

ie by reversing the momentum and following the leapfrog dynamics backwards from

the nishing p oint one do es not arriveback at the starting p oint

The solution is to rst up date just the parameters by one step using stepsizes based

on the currentvalue of the hyp erparameters This is the same dynamical up date with

stepsize selection as b efore and due to Fact of Section it is reversible as the

hyp erparameters havenotchanged Then we up date the hyp erparameters by one step

using stepsizes based only on the newlycomputed value of the parameters This is also

reversible as the parameters do not change over the step These two up dates comprise

one step in the new leapfrog tra jectory up dating b oth parameters and hyp erparameters

By rep eating this l times we obtain a leapfrog tra jectory of length l that by b eing

reversible in each step is fully reversible end to end

P



Due to Fact each step leaves H q p E q p approximately constant

i

and so H is left approximately constan tover the entire tra jectory for small enough

Also phase space volume is conserved byeach step due to Fact and so it is conserved

over the entire tra jectoryThus we see that wehave a tra jectory that keeps H roughly

constant and is a valid Metrop olis prop osal due to reversibility and phase space volume

conservation

There is actually a slight complication if we up date rst the parameters then the

hyp erparameters the reversed tra jectory is the one that up dates rst the hyp erparame

ters and then the parameters which is not actually the one we are using Toovercome

this problem at the b eginning of a tra jectorywecho ose with equal probability to up date

either the parameters or the hyp erparameters rst Thus a tra jectory that go es from

p oint A to p oint B is prop osed with probability while one that go es in reverse from

Chapter Hyperparameter Updates Using Hamiltonian Dynamics

B to A is prop osed also with probability and so wehave symmetric prop osals We

summarize this algorithm in Algorithm There is a dierent stepsize for each com

p onentofq including the expanded p ortion of the state and we denote the set of

stepsizes for the parameters as and the stepsizes for the hyp erparameters as

Algorithm Leapfrog tra jectory that up dates b oth parameters and hyp erparameters

using Hamiltonian dynamics

r U

if r then

for i tol do

P ar amS tepsiz eq

fq p gLeapf r og U pdateq p q

H y per par amS tepsiz eq

fq p gLeapf r og U pdateq p q

end for

else

for i tol do

H y per par amS tepsiz eq

fq p gLeapf r og U pdateq p q

P ar amS tepsiz eq

fq p gLeapf r og U pdateq p q

end for

end if

An alternativewayofachieving reversibility is to always start and end a leapfrog

tra jectory with either a parameter up date or with a hyp erparameter up date The exact

metho d used should not signicantly aect the p erformance of the algorithm

Chapter Hyperparameter Updates Using Hamiltonian Dynamics

Reparameterization of the Hyp erparameters

Numerically it is inconvenienttowork with the hyp erparameters as precisions as neg

ativevalues of precisions are invalid Thus we reparameterize the hyp erparameters by

working with log precisions instead

log

log

u u

log

v v

log

a a

log

b b

We let the set denote the reparameterized hyp erparameters

f g

u v a b

The p osterior probability densityischanged by this reparameterization The new

density is obtained bymultiplying by the appropriate Jacobian of eachvariable transfor

mation in turn

Y

P jx y P jx y

uv ab

X

P jx y exp

uv ab

And so the new p otential energy is

E log P jx y

X

E

uv ab

N

c

X X

e N

c c 

jf x y j E

c

uv ab

where

N

u

X u

e N

u u u



u E

u

i u

u

i

and E E and E are similarly dened

v a b

Chapter Hyperparameter Updates Using Hamiltonian Dynamics

Reparameterization of the Weights

The current parameterization scheme has a weakness that can b e seen by considering

the p otential energy as a function of the weights U and their hyp erparameter In

u

the absence of data the p otential energy dep ends on U and simply through E and

u

u

we see that it is shallow and broad for lowvalues of and narrow and deep for higher

u

values but not to o high This is b ecause for xed E is quadratic in u with width

u i

u

p p

p

u u

e e is the prior standard prop ortional to This makes sense as

u

deviation of u When there is data the landscap e will b e changed somewhat but the

i

tendencies imp osed by the priors will still b e there

The eect of this shap e of the p otential energy function is to makeitunlikely for a

sample that starts in the broad shallow region to end up in the narrow deep region The

reason is b ecause since H is approximately conserved during a leapfrog tra jectorya

particle that enters the narrow deep region from the shallow region has enough energy to

escap e out to the shallow region again and will indeed likely do so b efore we catchitin

the deep region since the latter has a comparatively small volume Similarly a particle

that starts o in a narrow deep region will likely not have enough total energy to escap e

unless it acquired an unusually large amount of energy during momentum resampling

This situation is sub optimal as it increases auto correlations Tomove around more

easily in state space weintro duce the following reparameterization of the weights

p



u

u u u e

i i u i

p



v

v v v e

i i v i

p



a

a a a e

i i a i

p



b

b e b b

b i i i

and wecontinue to use

log

log where u v a b

Chapter Hyperparameter Updates Using Hamiltonian Dynamics

The notation of Chapter will continue to apply except we will use a tilde to indicate

a reparameterized parameter For instance U will represent the group of reparameter

N

u

ized inputtohidden weights fu g andU will represent a reparameterized weight

i ij

i

from input unit i to hidden unit j We will also use the following naming for all the

reparameterized weights

fU VA B g

Due to the reparameterization the p osterior probability of the parameters and the

hyp erparameters must change accordingly The complete reparameterization is obtained

from Eqn as

N

N N N

u v a b

Y Y Y Y

u v a b

i i i i

P jx y P jx y

u v a

i i i b

i

i i i i

X

P jx y exp N

uv ab

And so under the reparameterization the p otential energy b ecomes

E log P jx y

X X

N E

uv ab uv ab

N

c

X X

c c 

e jf x y j E N e

c

uv ab

where

N

u

X

u



u

e u E

u u

i u

u

i

and similarly for E E and E

v a

b



It can b e seen that the the quadratic termu in E now has a constant co ecient

i u

so this reparameterization is eective at removing the variation with its hyp erparameter

of the width of u s p otential b owl i

Chapter Hyperparameter Updates Using Hamiltonian Dynamics

To distinguish b etween the new metho ds with and without the reparameterization

of the weights we will call the new metho d b efore the weight reparameterization the

Dynamical A metho d and the new metho d with the weight reparameterization the Dy

namical B metho d

First Derivatives of the Potential Energy

Each leapfrog step up date requires the rst derivatives of the Hamiltonian with resp ect

to the parameters and the hyp erparameters To take steps that are appropriately scaled

in the various directions for stability and eciencywe need the second derivatives as

well whichwe give in Section Here wegive the expressions for the rst derivatives

We rst dene

c c c 

L e jf x y j

which is the negative log likeliho o d of one training case less the normalizing term We

then obtain from Eqn

N

c

X

E

c

N e L

c

N

c

c

X

L E

e

c

where as usual u v a or b and

N

c

c

X

E L

U

ij

U U

ij ij

c

Derivatives for the other weighttyp es are obtained from Eqn by replacing U

ij

by the corresp onding parameter

To compute the derivative E using Eqn we need to compute the net

work outputs for every training case This involves p erforming a forward pass through

Chapter Hyperparameter Updates Using Hamiltonian Dynamics

the net for each training case each pass requiring compute time of order the number

of connections in the network The fact that we do a forward pass means that using

backpropagation to compute other rst derivatives will b e ecient Wenow explain how

these other derivatives are computed

First Derivatives with Resp ect to Parameters

To compute the derivatives with resp ect to the parameters Eqn weneedtond

c

the corresp onding derivatives of the output L They are most eciently computed using

backpropagation provided certain results are stored during a preceding forward pass such

as the one required by the ab ove computation of E

The backpropagation works as follows Consider Figure which represents two

arbitrary adjacentlayers in the network

D

 

h

j

 

D

g

I

j



V e

ij

S

 

h

i

 

S

g

i

Figure Two adjacentlayers



Here V e is the weight connecting a source unit i to a destination unit j We use

ij

g to denote the total activation of unit i b efore the tanh nonlinearityand h to denote

i i

the output of unit i after the nonlinearity Source unit values are denoted by sup erscript

S while destination unit values have sup erscript D The total input into destination

unit j is

X



  S D

V e b e h g

ij j

i j i

Chapter Hyperparameter Updates Using Hamiltonian Dynamics

where is the hyp erparameter controlling the biases b Thus the rst derivativeofthe

j

p otential energy with resp ect to V can b e computed as follows

ij

D

c c

g

L L

j

D

g

V V

ij j ij

c

L

 S

e h

i

D

g

j

A similar expression holds for derivatives with resp ect to biases In the ab ove if j is

D

actually an output unit then g f so

j

j

c

L

c c

e f x y

j

j

D

g

j

c

The outputs f x can once again b e considered to have already b een obtained

j

free from the forward pass As in standard backpropagation we start with the ab ove

error derivatives at the output layer and propagate them backwards using

D

S c c

X

g

h L L

j

i

S D S S

g g h g

i j i i

j

c

X

L

 S 

sech g V e

ij

i

D

g

j

j

If g was stored for every unit during the forward pass the ab ove derivatives can b e

i

computed rapidly in a backward pass taking time of order the numb er of connections in

the network for each training case Actuallywe will see b elow that if we can save only

one quantity the most useful one is

X

 S

e h V

j ij

i

i

which is the total input going into unit j from all the units feeding into it From g can

j j

easily b e obtained in constant time Furthermore it will b e useful in other calculations

that will b e presented in the subsequent discussion

Note that thus far wehave seen how the computation of E and the rst

derivatives with resp ect to all the parameters is dominated by a forward pass and a

backward pass through the network

Chapter Hyperparameter Updates Using Hamiltonian Dynamics

First Derivatives with Resp ect to the Hyp erparameters

The computation of E has already b een describ ed To compute E weneed

c

to compute the corresp onding derivatives of L In a similar fashion as the weights

computation we can write for the hyp erparameter in Eqn that controls weights

D

c c

X

g

L L

j

D

g

j

j

c

X X

L

 S

e h V

ij

i

D

g

j

j i

c

X

L

j

D

g

j

j

while for the hyp erparameter

D

c c

X

g

L L

j

D

g

j

j

c

X

L





b e

j

D

g

j

j

c

We already computed the derivatives L g during the backward pass for the deriva

j

tives of the parameters By ensuring that wesave during the forward pass the rst

j

c

derivatives of L with resp ect to the u v a and b hyp erparameters can b e eciently com

puted in time of order the numb er of units in the network Summed over all cases the

computational cost is O N N N

c h y

Because the numb er of units is considerably smaller than the numb er of parameters

calculating these rst derivatives with resp ect to the u v a and b hyp erparameters is

considered to add negligible cost to the forward and backward passes wehave already

done Therefore the computation of all the rst derivatives is dominated by the forward

j and backward passes through the net eachofwhich takes O N j c

Chapter Hyperparameter Updates Using Hamiltonian Dynamics

Approximations to the Second Derivatives of the

Potential Energy

The second derivative of the p otential energy with resp ect to the weights and the hyp er

parameters are needed to compute stepsizes for each leapfrog step up date Unlikethe

rst derivatives we cannot compute the second derivatives exactly b ecause in order to

preserve phase space conservation and reversibility of leapfrog steps the stepsizes used

for sayhyp erparameters cannot dep end on the currentvalues of the hyp erparameters

We will also use additional simplications to makeevaluation easier and faster

From Eqn we see that the problem is to obtain for the parameters

N

c

  c

X

E L

 

u u

i i

c

and similarly forv a and b and for the hyp erparameters

i i i

N

c



X

E

c

L e



c

and for taking the values u v a and b

N

c

 c 

X

L E

e

 

c

Second Derivatives with Resp ect to the Parameters

 c 

To obtain the derivative L u wefollow the heuristic given by Neal App endix

i

A whichwe include here for completeness The heuristic op erates by approximately

c

backpropagating the nd derivativeof L with resp ect to the output units back through

the net Its details are as follows

Chapter Hyperparameter Updates Using Hamiltonian Dynamics

Referring to Fig Neal uses the following approximation



D

 c  c

g

L L

j

D





g

V V

j ij

ij

c 

 c

x for i an input unit

i

L

e

D



g

j

otherwise

Corresp ondingly for biases

 c  c

L L



e

D





g

b

j

j

 c D 

We see that it is necessary to compute the derivatives L g We will do so in

j

c D

away analogous to the backpropagation of the rst derivative L g as describ ed in

j

Section In the case that unit j is an output unit the derivative is xed namely

 c  c

L L

e

D c





g f

j j

c

This is propagated backwards to obtain the second derivatives of L with resp ect to

all the inputs g s except for the input units whose derivatives are not needed Neal

i

propagates the derivatives using

 c  c

X

L L



V e

ij

S



D



g

g

i

j

j



Because we are not allowed to use the currentvalue of the parameters we replace V

ij



in the ab oveby the estimate since V has variance at equilibriumThus we actually

ij

use

 c  c

X

L L

e

S



D



g

g

i

j

j

From Eqns and we see that the second derivatives are the same for all

training cases Thus the backpropagation pass is done only once regardless of the number

of training cases after which the second derivatives with resp ect to the parameters can

b e estimated in time of order equal to the numb er of parameters

Chapter Hyperparameter Updates Using Hamiltonian Dynamics

Second Derivatives with Resp ect to the Hyp erparameters

P

N

c

c

To obtain L in Eqn we need to calculate the network output for each training

c

case Like the computation for E this can b e done with a forward pass for each

case Unlike that computation we cannot use the currentvalues of the hyp erparameters

In this thesis we replace by the prior hyp erparameter means

log

log

u u

log

v v

log

a a

log

b b

Thus we really compute

N

c



X

E

c

L e



c

For the second derivatives with resp ect to the other hyp erparameters we similarly

replace all o ccurrences of by From Eqn we see that we need to estimate

 c 

L For this we will use the same kind of backpropagation as when estimat

ing the second derivatives with resp ect to the parameters

Once again referring to Fig for b eing either a hyp erparameter for the weights

D

or the biases that contribute to the calculation of the g s

j

 c c

L L



D

c

X

g

L

j

D

g

j

j

 D D

c  c

X

g g

L L

j j

D D



g g

j j

j

D

D D

 c c

X X X

g

g g



L L

j j j

D D D

g g g



j j

j



j j j

Chapter Hyperparameter Updates Using Hamiltonian Dynamics

 D  D

where wehave made the replacement g g which can easily b e

j j

checked Thus

D

D

 c c  c

X X

g

g



L L L

j j

D D



g g



j

j



j

j

For the second term following Neal App endix A weignoremultiple con

nections from the same unit whichamounts to dropping all cross terms We end up

with



D

 c c  c

X

g

L L L

j

D

 

g

j

j

 c D 

Wehave already seen howwe can estimate L g using the approximate back

j

propagation Eqn of Section The dierence here is that wecanusethe

actual values of the parameters but not the hyp erparameters Thus we replace by its

estimate in the backpropagation equation

 c  c

X

L L



V e

ij

S D

 

g g

i j

j

As b efore we compute the ab ove quantityina backward pass only down to the rst

hidden layer and these derivatives are all indep endent of the training case

For the second factor in a summation term in Eqn evaluation is straightforward

From Eqn wehave





D

g

e

j



b

j

for a hyp erparameter controlling biases and

n

 

D

X

g

e

j

S

V h

ij

i

i



j

for a hyp erparameter controlling weights Once again if wesave during the forward

j

 

pass to obtain the network outputs for estimating E the ab ove factors can b e

Chapter Hyperparameter Updates Using Hamiltonian Dynamics

obtained almost for free We emphasize here that the this diers from the used in

j j

rst derivative calculations in that the forward pass during which is saved uses the

j

estimates instead of

Thus far the dominating computation in estimating the second derivativeof E with

 

resp ect to the hyp erparameters is the estimation of E which requires a forward

pass for every training case

c

Let us now turn our attention to the rst derivative L in Eqn Its

computation was already discussed in Section except that it uses the estimates

instead of and the error propagated backwards comes from the forward pass used in

 

estimating E

The estimation of this derivative requires one backward pass whichmust b e done

 

for each training case Thus combined with the forward passes of E twopasses

through the net are necessary to estimate the second derivatives of E with resp ect to the

hyp erparameters

Summary of Compute Times

In this section we summarize the compute time required by the Dynamical B metho d p er

leapfrog up date of b oth the parameters and hyp erparameters Recall that the Dynamical

B metho d includes the reparameterization of the weights

First we summarize the compute times required to calculate each group of derivatives

in Table The compute times are dominated by passes through the network for each

training case whichtakes time O N jj and we consider other op erations as essentially

c

free

Examining a leapfrog up date in detail we see that it lo oks likeTable

When the parameters are changed step the rst derivatives of b oth parameters

and hyp erparameters need to b e recalculated in order to up date the momenta in steps

Chapter Hyperparameter Updates Using Hamiltonian Dynamics

Group of Derivatives Compute time

st with resp ect to parameters and hyp erparameters O N jj

c

nd with resp ect to parameters free

nd with resp ect to hyp erparameters O N jj

c

Table Cost of computing various groups of derivatives for the Dynamical B metho d

Description Step

Up date momentum of parameters

Up date parameters

Up date momentum of parameters

Up date momentum of hyp erparameters

Up date hyp erparameters

Up date momentum of hyp erparameters

Step is then rep eated for the next leapfrog up date

Table The steps in one complete leapfrog up date in the Dynamical B metho d The

Dynamical A metho d has the same sequence of steps comprising one leapfrog up date

and The cost is one forwardbackward pass pair Also at the end of step the

second derivatives with resp ect to the hyp erparameters need to b e recalculated for use

in steps through taking a second forwardbackward pass pair After the up date of

the hyp erparameters at step the rst derivatives need to b e recalculated again for the

momentum up dates at step and step of the next complete leapfrog up date This

requires a third forwardbackward pass pair Finally the second derivatives with resp ect

to the parameters also need to b e recalculated for use in step of the next iteration but

this is essentially free

Thus each leapfrog up date costs forwardbackward pass pairs This result will

b e used later in determining how long to let the parameterized new metho d run when comparing its p erformance to the old metho d

Chapter Hyperparameter Updates Using Hamiltonian Dynamics

Computation of Stepsizes

Wehave describ ed our heuristic for approximating the second derivative of the p otential

energy with resp ect to the hyp erparameters The stepsize is then computed as the inverse

square ro ot of that second derivative as in Eqn But b ecause this heuristic uses

Eqn second derivatives have the p ossibility of b eing negative and square ro ots

would then b e imaginaryWe note that when the second derivative b ecomes negative

it merely indicates that the p otential energy surface is nowconcavedownwards but its

magnitude should still b e indicative of the length scale of the surface variations in the

region Thus the negative sign is really no problem and wetake absolute values to

obtain the stepsize as





E



Similarly the stepsize for the parameters are obtained as





E



u

i

and similarly for the other parameter typ es

Compute Times for the Dynamical A Metho d

We also give the compute time for one leapfrog up date for the Dynamical A metho d

as this has to b e taken into account later in p erformance comparison Recall that the

Dynamical A metho d is the new metho d b efore the weight reparameterization

The computation of the rst and second derivatives with resp ect to the weights in this

scheme are not signicantly dierent from Dynamical Bs The algorithmically demand

c

ing p ortions of these computations are the derivatives of L with resp ect to a weight

and the backpropagation given ab ovework the same way except that the



factors of e that always go with the reparameterized weights are missing Thus rst

Chapter Hyperparameter Updates Using Hamiltonian Dynamics

derivatives with resp ect to the weights also require one forward and one backward pass

for each training case while second derivatives are essentially free

The computation of the rst derivatives and second derivatives with resp ect to the

hyp erparameters dier substantiallyhowever b ecause the hyp erparameters do not ap

p ear in the computation of the network output f in this case The rst derivatives

are

N

c

X

E N e



c c

f x y

c

and

N

u

u

X

N e E

u u u



u

i

u u

i

and similarly for hyp erparameters of v a and b

The compute times for hyp erparameters of rst derivatives of where u v a

and b take time of order the numb er of parameters and is indep endentofthenumber

of training cases This makes it of lower order time complexity than that the forward

and backward passes for computing the rst derivatives of the parameters The compu

tation of E requires all the network outputs for each training case but whichwere

already computed during the forward passes for the rst derivatives with resp ect to the

parameters so weessentially get this derivative for free as well

The second derivatives with resp ect to the hyp erparameters are

N

c



X

E e



c c

f x y



c

and

N

u



u

X

e E

u



u

i



u

u

i

and similarly for hyp erparameters of v a and b

The compute times for each second derivative is essentially the same as that for the

rst derivative with resp ect to the same hyp erparameter Once again we can use the

Chapter Hyperparameter Updates Using Hamiltonian Dynamics

network outputs already computed for the rst derivatives with resp ect to the parameters

Note that this diers from the case of the reparameterized weights b ecause there the

hyp erparameters are involved in computing the network outputs and so the the outputs

computed during the forward passes for the rst derivatives cannot b e used as the

second derivatives with resp ect to the hyp erparameters cannot use the currentvalues

of the hyp erparameters That forced us to redo the passes through the network with

estimates for the hyp erparameters but we do not have to do that here therebysaving

computation

We summarize the various compute costs in the Table

Group of Derivatives Compute time

st with resp ect to parameters and hyp erparameters O N jj

c

nd with resp ect to parameters free

nd with resp ect to hyp erparameters free

Table Cost of computing various groups of derivatives for the Dynamical A metho d

Like b efore one leapfrog up date requires the computation of all the rst derivatives

twice Therefore one leapfrog up date requires two forwardbackward pass pairs in this case

Chapter

Results

Training Data

Toverify the new metho ds and compare their p erformance with the old metho d the

synthetic data set of Table was used We use a small data set to reduce the compute

time required to obtain the results for this thesis Even then months passed b efore all

the necessary runs were completed

Input x Input y Output z

e e e

e e e

e e e

e e e

e e e

e e e

e e e

e e e

Table Training data has two inputs and output These data are plotted in Fig

Chapter Results

3

2.5

2

1.5 z 1

0.5

0

−0.5 1

0.5

0

−0.5 0.8 1 0.4 0.6 0 0.2 −1 −0.4 −0.2 y −0.8 −0.6

x

Figure p oints comprising the synthetic data set used There are inputs x and

y and target z The training data are shown using asterisks The circles represent

the training data b efore the addition of Gaussian noise while the crosses are the input

data drawn on the plane z

5

4

3

2 z 1

0

−1

−2 1

0.5 1 0.5 0 0 −0.5 −0.5 −1 y −1

x

Figure The surface from which the synthetic data was taken

Chapter Results

The training data was synthesized as follows The inputs x y were uniformly

i i

drawn from The mapping used on each input pair was calculated using

the function b elow

 

f x y x y y x

This mapping is illustrated in Fig Gaussian noise with standard deviation

was added to each function output to obtain the target

Verication of the New Metho ds

Toverify the correctness of the new metho ds the p osterior distribution was obtained

using all the metho ds for a network of hidden units on the ab ove training data The

old Gibbs sampling program as implementedby Neal was treated as the standard against

which the new programs were compared

The priors for the hyp erparameters are sp ecied as gamma distributions as in Eqn

using and parameters For the demonstration of the correctness of the new

metho ds they were set as in Table

Parameter Setting Parameter Setting





u

u



v

v



a

a



b

b

Table Settings for parameters sp ecifying the priors of the hyp erparameters

Fairly long runs were done with all three metho ds at the settings in Table

Chapter Results

Results From Old Metho d

Metho d l Saved every

Gibbs

Dynamical A

Dynamical B

Table Settings for the various metho ds in order to verify correctness of new programs

For the new metho ds we set the parameter and hyp erparameter stepsize adjustment

factors equal to each other and indicate their value by in this table

The output surface as predicted from samples obtained using the Gibbs up date

metho d is shown in Fig We see that the data p oints are b eing tted reasonablyThe

fact that the p oints are b eing tted implies that the p osterior distribution has changed

from the prior Indeed from Fig we see that the marginal p osterior distributions of

the hyp erparameters dier from the marginal prior distributions

Weshow in Fig the correlation b etween the inputtohidden and the hidden bias

hyp erparameters As exp ected when larger inputtohidden weights are allowed larger

biases with the opp osite sign are required to comp ensate This is b ecause the output

function cannot b e comp osed of hidden units that all saturate so the input into at least

some of the hidden units must b e kept small

Also weshowinFighow as the numb er of hidden units increases the hidden

bias hyp erparameter b ecomes more pinned to the actual standard deviation of the hidden

biases and vice versa This is manifested as an increased correlation b etween them This

is a direct demonstration of the problem we set out to solve

Results of New Metho ds Compared with the Old

Toverify the correctness of the programs implementing the new metho ds we compare the p osterior distributions obtained using the new programs against that from the old

Chapter Results

2.5

2

1.5

1 z

0.5

0

−0.5 1

0.5 1 0.5 0 0 −0.5 −0.5 −1 y −1

x

Figure The surface predicted by the samples obtained for the master run for hidden

units The training data are shown using asterisks while the crosses are the input data

drawn on the plane z

To obtain the p osterior distribution any samples near the b eginning found by visual in

sp ection not to b e in equilibrium were rst dropp ed samples from each metho d were

then used to plot the histogram of eachhyp erparameter Here we lo ok at the histogram

of log which is the log of eachhyp erparameter sp ecied as a standard deviation Each

such histogram approximates the marginal distribution of a hyp erparameter

We need to b e able to compare the joint distribution over hyp erparameters for two

metho ds Because we are unable to plot a distribution over the joint dimensional space

of all the hyp erparameters we compare marginal distributions instead This compari

son is valid b ecause if the marginal distributions of the hyp erparameters matchfortwo

metho ds then they almost certainly have the same joint distribution over hyp erparame

ters While it is true that in principle equal marginal distributions do es not imply equal

joint distributions the fact that the marginals match is to o amazing a coincidence to b e

explained any other waythanby concluding that the joint distributions match

Chapter Results

0 1 10 10

−1 10 0 10

−2 10

−1 10

−3 10

−2 10 −4 10

−5 −3 10 10 0 20 40 60 80 100 120 140 160 180 200 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 τ τ δ u

0 1 10 10

−1 0 10 10

−2 −1 10 10

−3 −2 10 10

−4 −3 10 10 0 5 10 15 20 25 30 0 1 2 3 4 5 6 7 8 9 10 τ τ v a

−1 10

−2 10

−3 10 0 2 4 6 8 10 12 14 16 18 20 τ

b

Figure Dotted lines represent prior densities obtained analytically while solid lines

represent p osterior marginal distributions obtained from the Gibbs sampling metho d

These plots obtained using hidden units show that the p osterior distributions of the

hyp erparameters havechanged from the priors

Chapter Results

3 10

2 10

1 a

σ 10

0 10

−1 10 −1 0 1 2 3 10 10 10 10 10 σ

u

Figure Correlation b etween inputtohidden and hidden bias hyp erparameters ob tained using the old metho d

7 7

6 6

5 5

4 4

3 3 ) ) i i

2 2 log SD(a log SD(a 1 1

0 0

−1 −1

−2 −2

−3 −3 −2 −1 0 1 2 3 4 5 6 7 −2 −1 0 1 2 3 4 5 6 7 log σ log σ

a a

a hidden units b hidden units

Figure Correlation b etween the standard deviation of the hidden biases and the their

hyp erparameter b ecomes stronger as the hidden layer size increases

Chapter Results

As can b e seen in Figs and the marginals for all the hyp erparameters

obtained by the programs running b oth the new metho ds match those of the original as

implementedby Neal

Metho dology for Evaluating Performance

Having shown that the new metho ds have b een correctly implemented wearenow ready

to assess their p erformance

A Markovchain Monte Carlo metho d typically go es through a burn in phase b efore

settling down to equilibrium Before reaching equilibrium its samples are not representa

tiveofitsinvariant distribution As these nonrepresentative samples should b e discarded

in order not to skew later estimates the sp eed with which a Markovchain equilibrates

is a matter of interest However due to time constraints we will not b e considering

this question Instead we will only assess the relative p erformances of the old and the

new metho ds in moving ab out in the p osterior distributions of the hyp erparameters once

equilibrium has b een reached We do not consider the sp eed with which the p osterior of

the parameters is explored as there can b e large numb ers of parameters and it is dicult

to knowwhich parameters to compare as they can sometimestaken on dierent roles

In addition to examining the p erformances of the original metho d and the Dynamical

B metho d we also lo ok at the p erformance of the Dynamical A metho d to ascertain if

the reparameterization of the weights is indeed b enecial

The p erformance of the metho ds dep ends on the numb er of leapfrog steps allowed in

one tra jectory and the stepsize adjustment factor These can b e viewed as tuning param

eters that aect the eciency of each metho d Since p erformance can vary dramatically

dep ending on the setting of these tuning parameters it is only fair to compare howwell

the metho ds w ork when optimally tuned

For the old metho d the tuning parameters are l thenumb er of leapfrog steps allowed

Chapter Results

0.18 0.2

0.16 0.18

0.16 0.14

0.14 0.12

0.12 0.1

0.1 Frequency

0.08 Frequency 0.08

0.06 0.06

0.04 0.04

0.02 0.02

0 0 −5 −4 −3 −2 −1 0 1 2 −4 −3 −2 −1 0 1 2 3 4 5 6 log σ log σ δ u

0.2 0.25

0.18

0.16 0.2

0.14

0.12 0.15

0.1 Frequency Frequency 0.08 0.1

0.06

0.04 0.05

0.02

0 0 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −6 −4 −2 0 2 4 6 log σ log σ v a

0.25

0.2

0.15 Frequency 0.1

0.05

0 −4 −3 −2 −1 0 1 2 3 4 log σ

b

Figure Crosses represent the distribution obtained from the original Gibbs up date

metho d while circles represent that of the Dynamical A up date metho d Each distribu

tion has b een normalized to have area

Chapter Results

0.18 0.25

0.16

0.2 0.14

0.12

0.15 0.1 Frequency

0.08 Frequency 0.1

0.06

0.04 0.05

0.02

0 0 −5 −4 −3 −2 −1 0 1 2 −4 −3 −2 −1 0 1 2 3 4 5 6 log σ log σ δ u

0.25 0.2

0.18

0.2 0.16

0.14

0.15 0.12

0.1 Frequency Frequency 0.1 0.08

0.06

0.05 0.04

0.02

0 0 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −6 −4 −2 0 2 4 6 log σ log σ v a

0.2

0.18

0.16

0.14

0.12

0.1 Frequency 0.08

0.06

0.04

0.02

0 −4 −3 −2 −1 0 1 2 3 4 log σ

b

Figure Crosses represent the distribution obtained from the original Gibbs up date

metho d while circles represent that of the Dynamical B up date metho d Each distri

bution has b een normalized to have area Note that the distribution for the original

metho d lo oks slightly dierent from that of the plots for the previous comparison with

the Dynamical A metho d b ecause the binning for the histograms is slightly dierent

Chapter Results

in one tra jectoryand the stepsize adjustment factor For the new metho ds l is also

a tuning parameter but wenowhavetwo stepsize adjustment factors for the param

p

eters and for the hyp erparameters Wehavetwo stepsize adjustment factors b ecause

h

we might wish to control how fast the parameters move compared to the hyp erparameters

in order to obtain the b est p erformance Moreover the dierent heuristics with which

the stepsizes are computed for the parameters and the hyp erparameters means that their

relative magnitudes might b e quite dierent which also suggests separate stepsize ad

justment factors However due to time constraints we will not explore the problem of

how to set the two adjustment factors separately and will instead set them equal to each

other and call it with the understanding that the p erformance of the new

p h

metho ds could b e increased if the two s are not set equal to each other

The p erformace of a metho d on a hyp erparameter is assessed using the variance of

means measure whose presentation we delay till the next section For a given setting

of the tuning parameters this measure can b e computed for eachhyp erparameter The

smaller the measure is the more eciently the metho d explores the marginal p osterior

distribution of that hyp erparameter Since we wish to compare the metho ds when they

are op erating optimally we will compare their variance of means measures at optimal

settings of l and

To nd the optimal setting of l and for a given metho d we run it over a grid

of settings in tuning parameter space measuring its variance of means p erformance at

each setting The geometric mean of the variance of means of the hyp erparameters is

computed to obtain a single measure that combines the p erformances over the dierent

hyp erparameters In computing the geometric mean weleave out the variance of means

measure for the output bias hyp erparameter as that hyp erparameter controls only one

parameter for this network and is therefore not that meaningful The optimal setting of

the tuning parameters is then picked as the setting that minimizes the geometric mean

It is not safe to directly use the variance of means at this optimal setting to compare

Chapter Results

the p erformances of the metho ds however as they are biased downwards as a result of

this selection pro cess Instead the programs are then rerun with dierent random seeds

to reobtain the variance of means measures so as to avoid the bias In the next section

we describ e the variance of means measure in greater detail

The Variance of Means MeasurementofPerformance

The variance of means measure characterizes howwell a metho d explores the p osterior

distribution of a hyp erparameter at a xed setting of the tuning parameters To obtain

this measure we run several Markovchains each started from an indep endent p oint

drawn from the p osterior distribution of the parameters and hyp erparameters Each

chain is run for a xed number of N leapfrog steps regardless of what l is

f

For a given setting of the tuning parameters we obtain the means of the hyp erpa

rameters sampled byeachchain The entire chain of N leapfrog steps can b e thought

f

of as b eing divided up into a xed number of N sup ertransitions each comprised of

s

N N leapfrog steps Although each leapfrog tra jectory yields a sample from the cor

f s

rect distribution we use only one sample p er sup ertransition to compute the means of

the hyp erparameters for eachchain Thus regardless of the value of l the same number

of samples N is used to compute the means as shown in Table For the ith chain

s

the hyp erparameter means obtained are

i i

i i i

log log log log log

u v a

b

i

where for instance each mean log is computed as follows

u

N

s

X

ij

i

N log log

s

u u

j 

refers to a hyp erparameter expressed as a standard deviation Wetake the log

b efore computing the mean as exp erience shows that the standard deviation can vary

over several orders of magnitude and yet variations on a small scale are as interesting as

variation on a scale a few orders of magnitude larger

Chapter Results

Chain Samples

   N

s



log

   N

 s

log

u u u u u

   N

 s

log

v v v v v

   N



s

log

a a a a a

N   

s



log

b b b b b

   N

s



log

   N

s 

log

u u u u u

N   

s 

log

v v v v v

N   

s 

log

a a a a a

N   

s



log

b b b b b

N N N N  N 

N

m s m m m

m

log

N  N  N N N

N m m m m s

m

log

u u u u u

N  N  N N N

N

m m m m s

m

N log

m

v v v v v

N  N  N N N

N

m m m m s

m

log

a a a a a

N  N  N N N

N

m m m m s

m

log

b b b b b

Table Eachchain is run for N sup ertransitions at some setting of l and The

s

samples obtained from the sup ertransitions are used to compute the means of each

hyp erparameter

Chapter Results

If werun N Markovchains we will have N values of log Thevariance of means

m m

measure that wehave b een talking ab out is then just the variance of these values of

log We dene to b e these variance of means measures

Varlog

Varlog

u u

Varlog

v v

Varlog

a a

Varlog

b b

Eachvariance is calculated using the mean estimated from a very long run of the old

metho d whichweassumetobevery close to the true mean For example if the mean for

log obtained from a very long run of the old metho d is log thenwe compute

u u

the variance from N Markovchains as

m

P

N

m



i

log log

u

u

i

u

N

m

This is the variance of means measure of p erformance with lower variance indicating

b etter p erformance

Let T b e the ineciency factor in units of sup ertransitions for N samples of log

u s u

log of each sup ertransition of log relative T measures the worth in computing

u u u

to one indep endent sample of log For example if T is then it takes twice as

u u

many sup ertransitions to obtain a given variance of log as would b e needed using

u

indep endent samples Mathematicallythevariance of means measures are related to

ineciency factors as follows

T

u

Varlog

u u

N

s

Since Varlog is a constant prop erty of the p osterior and we are using the same

u

N for all tuning parameter settings our estimate of Varlog is prop ortional to T

s u u

Although this T is for this particular numb er of samples N onlywewould exp ect that

u s

Chapter Results

if a metho d has a lower T than another for this N then it really is more ecientat

u s

exploring the p osterior distribution and so it should remain b etter for a dierent N

s

Thus this measure is indicative of p erformance in general

Error Estimation for Variance of Means

Error bars on the variance of the means are obtained as follows Since the variance of

u



i

is calculated as the mean of the square deviations log log its variance is

u

u

given by



i

Var log log

u

u

Var

u

N

m

whichisvalid so long as the chains are indep endent Weensurethisbypicking their

starting p oints suciently far apart from a long master run and by using a dierent

random numb er seed for eachchain

Geometric Mean of Variance of Means

Rather than characterizing p erformance by the variance of the hyp erparameter means for

all the hyp erparameters we dene the following single geometric mean scalar measure of

p erformance



g

u v a

We obtain error bars for g by b o otstrapping see Efron and Tibshirani as

follows Let Z b e the original set of N Markovchains Thus Z determines a single value

m

of g A b o otstrap realization Z is a set of N Markovchains sampled uniformly with

m

replacement from the original chains During b o otstrapping many b o otstrap realizations

Z are generated from Z The idea is that the empirical distribution represented by Z

contains within it the natural variations that g has over the true distribution from which

Z is drawn So a value of g can b e calculated from each Z and the histogram of these

Chapter Results

resulting g s gives an estimate for the actual distribution of g We quote error bars as

the condence interval of the histogram of g iewe quote the error bar g g

lo hi

where g and g are suchthatofthevalues of g obtained from Z are b elow g and

lo hi lo

are ab ove g We use b o otstrap samples to obtain these error bars

hi

Iterations Allowed for Each Metho d

As shown in Eqn is prop ortional to the ineciency factor of the metho d in units

of sup ertransitions In order for comparisons of variances b etween two metho ds to make

sense the amount of compute time used p er sup ertransition should b e the same in b oth

cases However the compute time of a program is tricky to calculate from things suchas

number of multiplications as programming style can aect it Thus in this thesis the

amountofwork that go es into a sup ertransition is estimated in algorithmic terms in

whichwe assume that the numb er of training cases N is large and the numb er of hidden

c

units N is also large so that the number of weights j j is large This results in terms of

h

order N jj dominating the time complexitywhich comes entirely from complete sweeps

c

through the neural network for each training case

In the old metho d each tra jectory is comp osed of steps sampling the hyp erparam

eters computing the stepsizes and p erforming the leapfrog steps First we note that

each leapfrog step requires the evalutaion of the derivative of the p otential energy with

resp ect to a parameter

N

c

c

X

E f x

c c

f x y u

u i

u u

i i

c

which is most eciently computed for each training case using a forward sweep through

c c c

the network for f x y and a backward sweep for f x u Eachofthesweeps

i

takes O N jj compute time for all the training cases By comparison the sampling of

c

the hyp erparameters is dominated by the computation of the sum of the square of the

weights as can b e seen in Eqn and by the computation of the total squared error

Chapter Results

of the network output Eqn The former takes O jj while the latter costs only

O N since the network error has already b een computed at the end of the last leapfrog

c

step for the Metrop olis rejection test So b oth can b e neglected when compared to the

compute time required for a leapfrog step The computation of the stepsizes is also free

by comparison b ecause its summation over training cases Neal App endix A can

b e factored out and computed at the very b eginning of the program Thus each sup er

transition in the old metho d is approximately dominated by the leapfrog steps eachof

whichtakespairofforwardbackward sweeps eachofwhichtakes O N jj compute

c

time We summarize this along with the compute times for the new metho ds in Table

Hyp erparameter up dates by Network passes p er leapfrog step

Gibbs

Dynamical A

Dynamical B

Table For a detailed explanation of the numb er of passes required for the two

dynamical metho ds please refer to Section

Because of the results in Table the Dynamical B metho d is only allowed a third

as many leapfrog steps p er sup ertransition as the old metho d while the Dynamical A

metho d is allowed half These measures ensure that each sup ertransition uses approxi

mately the same amount of compute time regardless of metho d

Markovchain start states

Master Runs

As mentioned earlier eachchain is started from a state chosen from equilibriumTo

obtain the starting p oints for a particular network a very long run was done using the

Chapter Results

old metho d typically resulting in several hundred thousand to a million or more samples

Any initial p ortion of the run not in equilibrium was discarded and starting p oints were

then obtained from the remaining samples These starting states were spaced many

ineciency factors apart so that they are likely to b e nearly indep endent p oints that

represent the p osterior distribution well

The long run was done with a relatively small stepsize adjustment factor so that the

rejection rate is low around This is b ecause chains using large stepsize adjustment

factors may sometimes b e unable to enter certain regions of state space where its rejection

rate is high The reason for this is b ecause once it enters it is likely to stay there for a

long time due to its high rejection rate Because it remains stuck there for a long time

it follows that it is unlikely to enter that region in the rst place Thus using a low

rejection rate reduces the risk of overlo oking such regions

Long runs were obtained for networks with and hidden units Table

shows the prior settings used for these networks while table tabulates the various

run settings used the numb er of samples obtained and the resulting rejection rate

    

Hidden units

u v a

u v a b

b

Table Prior settings for the neural net architectures tested

In order to ensure that our start states are from the equilibrium distribution the

hyp erparameters from the master runs were visually checked for the absence of long term trends

Chapter Results

Hidden units l No of samples Rejection rate

Table Tuning parameter settings numb er of samples collected and rejection rates

Rejection rates were chosen to b e relatively low to reduce the risk of not b eing able to

enter regions where the rejection rates are high For all the runs only one sample was

saved for every samples collected

Starting States Used

Variance of means measures were obtained for networks with and hidden

units This requires several Markovchain starting states for eachnumb er of hidden units

This section shows in detail how these states were obtained

Table shows the ineciency factors of log asthatwas found to b e the slowest

u

moving hyp erparameter ie it had the highest ineciency factor Weuse to denote



It is the hyp erparameter expressed as a standard deviation and is often easier to

understand We compute the ineciency factor of log rather than simply as the

ineciency factor is then the same regardless of what p ower is raised to thus removing

any questions as to whether it is more appropriate to measure the ineciency factor of

the variance or the standard deviation

As eachchain used in the variance of means measure should b e started from an

indep endentpoint we use samples from the master run spaced many ineciency factors

apart

All the variance of means measures were computed using chains The start states

of the rst chains are evenly spaced according to Table It was decided to add

more chains b ecause it was empirically observed that the dierent metho ds seem to do

Chapter Results

very dierently on chains starting at large log and log large meaning greater than

u a

in this case Sp ecically it app eared from the rst chains that the old metho d and the

Dynamical A metho d do badly on chains starting at large values of these hyp erparam

eters whereas the Dynamical B metho d do es well Because such large hyp erparameter

values app ear rarely in the p osterior distribution there are few of them among the

chains so assuming that they do have an imp ortant eect on the results it is necessary

to oversample the region with large values of log and log and then downweight

u a

those p oints accordingly in the computation of the variance of the means Otherwise

one mightbychance compute the variance of means without anychain starting at large

values of those hyp erparameters and erroneously conclude that all the metho ds p erform

similarly

Hidden Start state separation Start state Initial samples

T

u

units samples separation T s dropp ed

u

Table First starting states are drawn so that they are many ineciency factors

T apart T is the ineciency factor of log

u u u

To ensure that wehave a decentnumber of points from the region G satisfying

log orlog or b oth it was decided to stratify the starting p oints so that

u a

of them are outside G and are inside Of the rst p oints chosen according to the

separations in Table some are already in GSothechoice of p oints through

was done as follows for eachnumb er of hidden units wechose enough additional p oints

in G so as to make them total in numb er and wealsochose enough p oints outside of

G so as to obtain of them

Chapter Results

To obtain these new p oints a long sequence of manypoints was obtained from

the master le at xed separations always at least ineciency factors For and

hidden units the new p oints were obtained past the end of the p ortion of the master

run used to obtain the rst starting states for and hidden units the new p oints

used the p ortion already used for the rst states and b eyond if available taking care

to always maintain at least ineciency factors from the rst p oints This sequence

was then uniformly sampled to obtain the desired number of points in G and p oints

outside of G

Mo died Performance Measures Due to Stratication

The stratication of the starting states do es change the calculation of the variance of

means measure and its error somewhat For N equaltothenumb er of p oints taken

G

from region G and N the numb er of p oints taken from outside the variance of means

G

within each stratum for hyp erparameter is

u

N

G

X



i

log log

u

u

G u

N

G

i

N

m

X



i

log log

uG u

u

N

G

iN 

G

so that the overall variance of means that takes stratication into accountis

p p

u uG

u G

where p is the fraction of the p osterior distribution in region G

For error estimates of the variance of the means we compute the variance of the

ab ove as b efore which is similarly computed as the weighted sum of the error variances

computed separately for each stratication

 

p Var Var p Var

uG u

G u

Chapter Results

Hidden units p

Table Estimated fraction of the p osterior distribution in region G for eachnumber

of hidden units Region G is dened as the region for whichlog orlog or

u a

b oth

The ab ove expressions correct for the oversampling of G using p Table gives

estimates of these quantities from each master run

It should b e noted that wehave assumed that p is known in the calculation of the

error bars of the variance of means measures but reallyallwehave are estimates This

simplication intro duces some inaccuracies into the error calculations but it is not likely

to change our conclusions much as we will see later

One might question why the various values of p in Table are quite dierent The

author has some empirical exp erience showing that G is a region of somewhat higher

rejection rate than normal around for hidden units It is p ossible that some of

the master runs have s that are high enough that they enter G rarely enough to make

a dierence in the estimates of p This asp ect of the exp eriment is dicult to control as

it is usually not p ossible to tell in advance what the regions with high rejection rates are

going to b e and if they will make a dierence to the nal variance of means estimates in

the end Furthermore if such regions are identied after obtaining master runs it can

be very costly computationwise to redo the master runs at a smaller setting of

to account as follows the Finally the b o otstrap pro cedure takes the stratication in

rst chains in Z are sampled uniformly with replacement from the rst chains in

Z while the last chains in Z are sampled uniformly with replacement from the last

Chapter Results

chains in Z This ensures that each realization Z is obtained from the same empirical

distribution represented by Z

Numb er of Leapfrog Steps Allowed

In accordance with the deemed ratios of computation involved in each sup ertransition

for the various metho ds Table dierentnumb ers of leapfrog iterations were used

for eachMarkovchain These are listed in Table

Hyp erparameter up dates by Leapfrog steps p er chain

Gibbs

Dynamical A

Dynamical B

Table The varying numb ers of leapfrog steps that were allowed p er Markovchain

in accordance with the ratios of computation involved in each sup ertransition for the

various metho ds as listed in Table

Results of Performance Evaluation

The optimal tuning parameter settings for each metho d and for eachnumb er of hidden

units were assessed from Figs to The optimal tuning parameters thus obtained

are given in Table The corresp onding g s and variance of means and rejection rates

obtained at these optimal settings but in new runs with new random seeds are given in

Table

As can b e seen from Table the optimal setting of for the Gibbs and Dy

namical A metho ds tends to drecrease with increasing hidden layer size Dynamical B

interestingly increases with increasing hidden layer size

Chapter Results

−2 10

−2 10

−1 10 −1 10 Geometric mean of variances means

Geometric mean of variances means 0 0 10 10 5 5 10 10

1 10

0 1 10 10 0 −1 10 10 −1 0 10 −2 0 10 10 −2 l 10

l 10 η η

a Gibbs up date b Dynamical A up date

−2 10

−1 10

0 10 Geometric mean of variances means 5 10

4 10

3 10 2 2 10 10 0 10 1 10 −2 10 l 0 10 −4

10 η

c Dynamical B up date

Figure hidden units a circle is drawn at the optimum p oint and the surface as

a function of l and are backpro jected on to the walls at optimal and l resp ectively

Note that the vertical scale is inverted

Chapter Results

−2 10

−2 10

−1 −1 10 10

0 Geometric mean of variances means Geometric mean of variances means 10 0 5 1 10 10 10 5 10 1 10 0 10 0 10

−1 −1 10 10

0 −2 0 −2 10 10

l 10 10 η l η

a Gibbs up date b Dynamical A up date

−2 10

−1 10

0 Geometric mean of variances means 10 5 10

1 10 0 10 −1 10 −2 10 0 l 10 −3

10 η

c Dynamical B up date

Figure hidden units a circle is drawn at the optimum p oint and the surface as

a function of l and are backpro jected on to the walls at optimal and l resp ectively

Note that the vertical scale is inverted

Chapter Results

−2 10 −2 10

−1 10 −1 10 Geometric mean of variances means Geometric mean of variances means 0 10 0 5 10 10 5 10 1 10 1 10 0 0 10 10 −1 −1 10 10 0 10 −2 0 −2 l 10 10 10

η l η

a Gibbs up date b Dynamical A up date

−2 10

−1 10

Geometric mean of variances means 0 10 5 10

1 10 0 10 −1 10 −2 10 0 10 −3

l 10 η

c Dynamical B up date

Figure hidden units a circle is drawn at the optimum p oint and the surface as

a function of l and are backpro jected on to the walls at optimal and l resp ectively

Note that the vertical scale is inverted

Chapter Results

−2 10

−2 10

−1 10 −1 10

Geometric mean of variances means 0 0 10

Geometric mean of variances means 10 5 5 10 10 1 10 1 10 0 10 0 10 −1 −1 10 10 0 0 −2 l 10 −2

l 10 10 η 10 η

a Gibbs up date b Dynamical A up date

−2 10

−1 10

0 1 10 10 Geometric mean of variances means 5 10 0 10

−1 10

−2 10

0 −3

l 10 10 η

c Dynamical B up date

Figure hidden units a circle is drawn at the optimum p oint and the surface as

a function of l and are backpro jected on to the walls at optimal and l resp ectively

Note that the vertical scale is inverted

Chapter Results

Hyp erparameter up dates by Hidden units l Rejection rate

Gibbs

Dynamical A

Dynamical B

Gibbs

Dynamical A

Dynamical B

Gibbs

Dynamical A

Dynamical B

Gibbs

Dynamical A

Dynamical B

Table The optimal tuning parameters for each metho d and eachnumb er of hidden

units

The values of g are plotted for eachnumb er of hidden units in Fig The error

bars are big and it is p ossible that there is essentially no dierence in all the three

metho ds However there is some evidence that when the number of hidden units N is

h

increased to the Dynamical B metho d b egins to work b etter than the old metho d

Note that the error bars in the gure are condence intervals of the p erformance

measures and do not takeinto account inaccuracies in assessing the optimal settings of

the tuning parameters Thus the real error bars are actually bigger by some unknown

amount

From the graph the Dynamical A metho d do es not seem to p erform that dierently

from the other metho ds except at hidden units The cause of this seeming anomaly

at hidden units is discussed in the next chapter

Chapter Results

The graph also shows that the Dynamical B metho d do es not showany improvement

over the old metho d that is measurable given the size of the error bars This is unfor

tunate However the old metho d do es seem to show a noticeable upward trend while

the two new metho ds do not This suggests that the old metho d is b ecoming more and

more inecient as the numb er of hidden units N increases even though the compute

h

time given to it is also increasing linearly with N Thus the graph suggests that the

h

compute time required to maintain the same level of p erformance as measured by g grows

sup erlinearly with N for the old metho d On the other hand the two new metho ds ap

h

p ear more likely to b e either linear or sublinear although it is dicult to tell for certain

with the limited numb er of data p oints and the noise

The size of the error bars imp edes our analysis of the results In the next section

weshowhowwe can p erform b o otstrapping on pairwise comparisons to obtain clearer

indications of how one metho d do es compared to another

Pairwise Bo otstrap Comparison

To more sensitively compare howtwo metho ds p erform we can compute the ratio of g s

for two dierent metho ds and use b o otstrapping to obtain a condence interval for that

ratio Here a b o otstrap realization is a choice of Markovchain start states rather

than Markovchains as the chains themselves dier for the two metho ds This couples

the g s for the metho ds together causing them to b e evaluated at the same Markov

chain start state for each b o otstrap realization In this waywe mightbeabletobetter

distinguish b etween go o d and bad metho ds For example we might see that one metho d

always has a higher g than another when started from the same p ointeven though their

individual g s wander over a large range for dierent b o otstrap realizations so that the

error bars in the g s overlap signicantly

In Fig weshow these ratios with condence intervals obtained using

Chapter Results

0.22

0.2

0.18

0.16

0.14 g

0.12

0.1

0.08

0.06

0.04 6 8 10 12 14 16 18 20 22

Number of hidden units

Figure Comparison of g between the three metho ds at the optimal setting of the

tuning parameters as the numb er of hidden units increases Crosses are the old metho d

triangles are the Dynamical A metho d and circles are the Dynamical B metho d Error

bars terminate with the same symb ol that represents eachpoint These error bars rep

resent condence intervals and do not takeinto account the error in the estimation

of the optimal tuning parameters Thus the true error bar is greater than those shown

Chapter Results

Data noise hyperparameter Input−to−hidden weights’ hyperparameter 0.22 0.9

0.2 0.8

0.18 0.7

0.16 0.6 ] ] u

0.14δ σ σ 0.5

0.12 0.4 Var[mean log Var[mean log 0.1

0.3 0.08

0.2 0.06

0.04 0.1

0.02 0 6 8 10 12 14 16 18 20 22 6 8 10 12 14 16 18 20 22 Number of hidden units Number of hidden units

Hidden−to−output weights’ hyperparameter Hidden biases’ hyperparameter 0.07 0.8

0.7 0.06

0.6

0.05 ] ] a v σ

σ 0.5

0.04

0.4 Var[mean log Var[mean log

0.03 0.3

0.02 0.2

0.01 0.1 6 8 10 12 14 16 18 20 22 6 8 10 12 14 16 18 20 22 Number of hidden units Number of hidden units

Output biases’ hyperparameter 0.12

0.11

0.1

0.09 ] b σ 0.08

0.07 Var[mean log

0.06

0.05

0.04

0.03 6 8 10 12 14 16 18 20 22

Number of hidden units

Figure Comparison of optimal p erformance of the three metho ds as the number

of hidden units increases Error bars terminate with the same symb ol that represents

each p oint These error bars are the standard deviation in the eachvariance of means

measure and do not takeinto account the error in the estimation of the optimal tuning

parameters Thus the true error bar is greater than those shown

Chapter Results

b o ostrap samples From Fig a it seems fairly convincing that Dynamical B works

b etter than the old metho d for hidden units Fig b suggests that Dynamical

B tends also to work b etter than Dynamical A while Fig c is inconclusiveonthe

relative p erformances of Dynamical A and the old Gibbs metho d However we should

note that that there is some uncertainty in the error bars due to the fact that we might

not really have found the true optimal settings of the tuning parameters Also the

Markovchains we used might not have b een enough to capture all the imp ortant regions

Thus even though the pairwise comparisons might suggest that Dynamical B works

b etter than the Gibbs metho d for hidden units it is b etter to b e cautious and conclude

that Dynamical B may work b etter than Gibbs and if it do es it is not bymuch

Now that it is clear that Dynamical B is not as go o d as one might hop e the question

is why is that and can it b e made to go faster We address these questions as well

as the question of Dynamical As large error bars in g at hidden units in the next

chapter

Chapter Results

2 1.8

1.6 1.8

1.4 1.6

1.2 1.4 B A / g

/ g 1 B g Gibbs g 1.2 0.8

1 0.6

0.8 0.4

0.6 0.2 6 8 10 12 14 16 18 20 22 6 8 10 12 14 16 18 20 22

Number of hidden units Number of hidden units

b Dynamical B vs Dynamical

a Gibbs vs Dynamical B A

2.4

2.2

2

1.8

1.6

1.4Gibbs / g A g 1.2

1

0.8

0.6

0.4 6 8 10 12 14 16 18 20 22

Number of hidden units

c Dynamical A vs Gibbs

Figure Pairwise comparisons of the various metho ds obtained by taking the ratio

of g Error bars represent condence intervals obtained using b o otstrap realiza

tions Dashed lines have b een drawn at the level of which signies equal p erformance

Chapter Results

Hidden

Metho d g

u v a b

units

Gibbs

Dynamical

A

Dynamical

B

Gibbs

Dynamical

A

Dynamical

B

Gibbs

Dynamical

A

Dynamical

B

Gibbs

Dynamical

A

Dynamical

B

Table Variance of means p erformances obtained by rerunning the various metho ds

with new random seeds at the optimal tuning parameter settings Each error bar is the

standard deviation of its variance of means measure

Chapter

Discussion

Has the Reparameterization of the Network Weights

Been Useful

As wesaw in the last chapter the Dynamical A metho d seems to p erform comparably to

Dynamical B except for hidden units Up on closer examination we see that the large

value of g for the Dynamical A metho d at hidden units is due to it not p erforming

well for some Markovchain starting states with large hyp erparameter values for log

u

and log ie the second stratum This can b e seen in Fig and is a manifes

a

tation of Dynamical As inabilitytomoveeciently b etween large and small values of

hyp erparameters This eect is less pronounced for hidden units probably b ecause

the starting states in the second stratication are less extreme in value As mentioned

b efore this is an asp ect of the exp eriment that is dicult to control Nevertheless our

present results indicate that the Dynamical A metho d should b e avoided b ecause it may

move extremely slowly from some starting states

Chapter Discussion

8 8

7 7

6 6

5 5 a a σ 4 σ 4 log log

3 3

2 2

1 1

0 0 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 log σ log σ

u u

a Gibbs up date b Dynamical A up date

8

7

6

5 a σ 4 log

3

2

1

0 0 1 2 3 4 5 6 7 log σ

u

c Dynamical B up date

Figure These plots compare how the various metho d fare on the Markovchain

starting states with large hyp erparameters for hidden units The circles show the

starting states with large hyp erparameters the crosses showthehyp erparameter means

of all chains while the dots show the means for the Markovchains that started

at the large hyp erparameter values Dashed lines connect each starting state with its resulting mean

Chapter Discussion

Making the Dynamical B Metho d Go Faster

As wesaw in the last chapter the Dynamical B metho d oers only a small improvement

over the old metho d if any at all The question then is why

The key may lie in the rejection rates As can b e seen in Fig the rejection rates

of the Dynamical B metho d dier from those of the old metho d and the Dynamical A

metho d in that it shows a signicant increase with leapfrog tra jectory length l

The primary motivation for the new metho ds is to allowthehyp erparameters to b e

up dated with the parameters during the course of a leapfrog tra jectory For this to

explore the p osterior distribution eciently long tra jectory lengths are necessary As

we can see from Fig the Dynamical B metho d ended up having rejection rate

characteristics that p enalizes long tra jectory lengths with high rejection rates Thus the

optimal setting of l is not as long as wemightlike

With this observation in mind if we can nd out why the Dynamical B metho d has

this b ehaviour we might b e able to x it The next section formulates a simple mo del

that accounts for this increase in rejection rate with l

Explanation for the Rising Rejection Rates

If the stepsizes were innitesimally small the leapfrog tra jectories would simulate Hamil

tonian dynamics p erfectly and the rejection rate would b e zero However they are not

and are furthermore calculated by heuristics that may yield inappropriate values some

times This leads to rejection b ehaviour that aects the rate at which the Markovchain

explores the state space

There are two qualitatively dierentways bywhich a leapfrog prop osal under the

Dynamical B metho d ma y b e rejected In the rst way the leapfrog simulation of the

Hamiltonian dynamics is stable and H varies over a small range due to the discrete

nature of the simulation At the end of the tra jectory the distribution of H over this

Chapter Discussion

0 0 10 10

−1 −1 10 10

−2 −2 10 10

−3 −3 10 10

η = 0.010 η = 0.010 Rejection rate η = 0.020 Rejection rate η = 0.020 η = 0.050 η = 0.040 −4 −4 10 η = 0.100 10 η = 0.080 η = 0.141 η = 0.113 η = 0.200 η = 0.160 η = 0.283 η = 0.226 −5 η = 0.400 −5 η = 0.320 10 10 η = 0.800 η = 0.640 η = 1.600 η = 1.280

−6 −6 10 10 0 1 2 3 4 5 0 1 2 3 4 5 10 10 10 10 10 10 10 10 10 10 10 10

l l

a Gibbs up date b Dynamical A up date

0 10

−1 10

−2 10

−3 10 η = 0.001 η = 0.003 η = 0.005 −4 η

Rejection rate 10 = 0.010 η = 0.020 η = 0.040 η = 0.080 −5 10 η = 0.160 η = 0.320 η = 0.640 η −6 = 1.280 10 η = 2.560

−7 10 0 1 2 3 4 5 10 10 10 10 10 10

l

c Dynamical B up date

Figure Rejection rates versus tra jectory length l for hidden units

Chapter Discussion

small range determines the rejection rate In the second way the leapfrog simulation

b ecomes unstable at some p oint during the tra jectory due to the heuristics yielding

stepsizes inappropriate for that region of state space The eect of this is catastrophic

b ecause the simulation is almost never able to recover the value of H either diverges

or moves to a much higher value resulting in near certain rejection H has a much

greater tendency to move to a higher rather than a lower value b ecause stepsizes that

are inappropriately large may still b e stable in a wider orbit

To illustrate this instability H is plotted in Fig for tra jectories started from

the same state but with dierent randomlyselected initial momenta We see that once

an instability can o ccur at any time and once it o ccurs recovery is virtually imp ossible

and will lead to nearcertain rejection at the end of the tra jectoryThecumulative eect

of risking a grossly wrong stepsize with each step is that for very long tra jectories the

probability that we manage to get to the end without once having exp erienced a catas

trophic instabilityistiny Therefore the rejection rate should increase with tra jectory

length This explains the observed increase of the rejection rate with l for Dynamical B

Another view of the instabilityof H is shown in Fig whichshows howthe

distribution of H broadens with increasing numb er of leapfrog steps

The old metho d which up dates the hyp erparameters by Gibbs sampling is not prone

to this cumulative rejection eect as its stepsize is not b eing constantly recalculated

during a tra jectoryEven if its stepsize heuristic gives inapp orpriate stepsizes with to o

high a probability the stepsizes are computed only once at the b eginning of the tra jectory

and a go o d stepsize will tend to lead to a stable tra jectory no matter how long it is Such

long tra jectories then b ecome unstable only byentering a region for which its stepsizes are

inappropriate This eect leads to rejection rates that increases with l as the longer the

tra jectory the more likely such regions are encountered However the fact that rejection

rates for the old metho d showvery little dep endency on l see Fig indicates that

entry into such regions happ ens very rarelyThus for the old metho d most rejections

Chapter Discussion

180

160

140

120

100 H

80

60

40

20

0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 4

Leapfrog steps x 10

Figure H plotted over tra jectories started from the same state but with dierent

initial momenta Each tra jectory has leapfrog steps As each tra jectory progresses

it may b ecome unstable If it do es H usually rises catastrophically A transition to

innite H is shown here as a transition to was set at and a network of

hidden units was used

are due to the normal deviation of H away from its initial value

That the rep eated stepsize calculations handicaps the Dynamical B metho d with its

cumulative rejection eect might b e cause for p essimism However the fact that the

Dynamical A metho d achieves fairly at rejection rates Fig shows that the rise

in rejection rates is not an inescapable cost of calculating the stepsizes b efore each step

rather with appropriate stepsize heuristics it might b e p ossible to achieve at rejection

rates even in the Dynamical B metho d

The stepsize heuristics used in the Dynamical B metho d for the parameters are the

same as that used in the old metho d so they are unlikely to b e the cause of the catastroph

ically wrong stepsizes We exp ect that it is the stepsize heuristics for the hyp erparameters that is at fault The next section seeks to conrm this

Chapter Discussion

30 30

25 25

20 20

15 15 Frequency Frequency

10 10

5 5

0 0 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180

H H

a Before any leapfrog steps b After leapfrog steps

30 30

25 25

20 20

15 15 Frequency Frequency

10 10

5 5

0 0 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180

H H

c After leapfrog steps d After leapfrog steps

Figure The distribution of H broadens as a tra jectory progresses Innity is binned

at in these histograms tra jectories all started at the same state but with dierent

initial momenta were used to generate these histograms was set at A network

of hidden units was used

Chapter Discussion

The Appropriateness of Stepsize Heuristics

If a stepsize is inappropriately large and leads to instability then p erhaps a smaller

stepsize half as large might not If for example the parameter stepsizes are sometimes

inappropriate then under a leapfrog discretization where the hyp erparameter up date re

mains the same but where each parameter up date is split into two consecutive up dates

each with half the stepsize adjustment factor the acceptance probability p should in

crease since the parameter up date is now closer to the true Hamiltonian dynamics On

the other hand if the parameter stepsizes are usually appropriate and it is the hyp erpa

rameter stepsizes that are at fault then the rejection rate should not change much

As shown in Fig when the parameter up dates were split the rejection rates

did not change while they dropp ed when the hyp erparameter up dates were split This

indicates that the hyp erparameter stepsizes calculated according to the current heuristics

are often inappropriate Fig was obtained byaveraging the rejection rates over a

small number of Markovchains run at various settings of l with A network

with hidden units was used along with the training data from the last chapter The

Markovchain starting states were chosen from the ones used in the tests in the last

chapter with momenta randomly initialized from the unit normal distribution

To remedy the inapp opriateness of the hyp erparameter stepsizes it is p ossible that

the stepsize heuristics for the hyp erparameters needs to b e changed On the other hand

it is also p ossible that some setting of rather than is all that is necessary

h p h p

In the next two sections we explore these tw o p ossibilities

Dierent Settngs for 

h p

In this section we rep ort the results of some exp eriments to test the p ossibility that

some setting of can atten the rejection rate versus l curves without us having

h p

to change the stepsize heuristics

Chapter Discussion

0 10

−1 10

−2 10 Rejection rate

−3 10

Parameter updates split Hyperparameter updates split Regular run

−4 10 1 2 3 4 5 10 10 10 10 10

l

Figure Eect on rejection rates of splitting either the parameter up dates or the

hyp erparameter up dates into two while using the Dynamical B metho d Here

h p

This plot was obtained using a network with hidden units

We used a network with hidden units and the training data from the last chapter

Markovchains were run at various tra jectory lengths l with and and

p

The Markovchain starting states were taken from the states used for the

h p

tests in the last chapter and the momenta were randomly initialized from a unit normal

distribution The resulting rejection rates averaged over the chains are plotted in Fig

Compared to the case when the rejection rates do not rise as fast as the

h p

tra jectory length increases

This suggests that by decreasing the ratio itmay b e p ossible to gain enough of

h p

the advantage of having long tra jectory lengths to oset the smaller distances travelled

in each step due to the smaller

h

The geometric mean p erformance measure g was also calculated at each setting of l

It was found that the b est value of g is which o ccurs at l This

p

value of g is considerably worse than the optimal one found for the Dynamical B

Chapter Discussion

0 0 10 10

−1 −1 10 10 Rejection rate Rejection rate

η = 0.320 η = 0.320 η = 0.640 p η = 0.640 p

−2 −2 10 10 0 1 2 3 4 5 0 1 2 3 4 5 10 10 10 10 10 10 10 10 10 10 10 10

l l

a and b and

h p h p p p h p

Figure These pictures show rejection rates for hidden units when dierent ratios

of are used The rejection rates do not rise as fast with increasing tra jectory length

h p

when is set to b e smaller than The rst set of rejection rates were averaged over

h p

Markovchains while the second were obtained using

metho d at this numb er of hidden units So we see that even though we are able to take

much longer tra jectories with smaller the smallness of may erase our advantage

h p h

It is p ossible that some other setting of or some other ratio of do es b etter than

p h p

setting g but this question will not b e explored further in this thesis due to

time constraints The key conclusion of this section is that smaller ratios of can

h p

atten the rejection curveandmaybemoreadvantageous than simply setting

h p

Fine Splitting of Hyp erparameter Up dates

Apart from setting a lo w ratio for we conjecture that with suciently go o d

h p

heuristics we should also b e able to get the rejection rate to stayatas l increases To

test this we split the hyp erparameter up dates into ne up dates many more than

the two b efore The reason for doing this is that the more stable tra jectory obtained by

the splitting may roughly mo del what a go o d heuristic gives

Chapter Discussion

0 10

−1 10

Split updates Rejection rate

−2 10

η = 0.200 η = 0.080 η = 0.160 η = 0.320

−3 10 0 1 2 3 4 5 10 10 10 10 10 10

l

Figure Eect on rejection rates of splitting the hyp erparameter up dates into

up dates while using the Dynamical B metho d with Compared to the

h p

normal unsplit up dates rejection rates are nowmuchlower The network used here has

hidden units

The split hyp erparameter up dates were tried on a network with hidden units using

the same training data as in the last chapter and using Markovchain starting states

taken also from the tests conducted in the previous chapter Momenta were randomly

initialized from a unit normal distribution The Markovchains were run at

with various values of l Fig shows that the resulting rejection rates for runs with

split hyp erparameter up dates still increases with l but its rate of increase is muchgentler

now increasing ab out one order of magnitude from ab out This is much b etter than

the rejection rate obtained from the normal unsplit up dates whichwecontrast in the

same gure The unsplit up dates do worse even for less than half the size

This suggests that with suciently go o d heuristics tra jectories mightbeabletogo

far enough to truly reap the advantages of up dating the hyp erparameters dynamically

Chapter Discussion

Why the Stepsize Heuristics are Bad

A p ossible explanation for whythehyp erparameter stepsize heuristics do not work well is

that we cannot use the currentvalues of the hyp erparameters to compute their stepsizes

As we are unable to get an estimate of them from the weights due to their reparame

terization we are forced to use their prior means which are not necessarily very go o d

estimates

In retrosp ect this should have b een obvious The backpropagation of the second

derivative of the likeliho o ds multiplies together the hyp erparameter variances of each

layer of weights that it propagates derivatives through For instance the second deriva

tive of the likeliho o d with resp ect to the input to hidden weighthyp erparameter is

u

prop ortional to the pro duct of the prior variances of the hidden to output

u v

weights and the input to hidden weights In an hidden unit run our settings were such

that Yet from Fig it is clear that as the Markovchain ranges over

u v

the p osterior distribution the pro duct of these twovariances can actually range up to



e or more Clearly as an estimate of is bad This can lead to

 

stepsizes whicharee times larger than what they would havebeen

if we had used the actual values of the hyp erparameters

The Dynamical A metho d do es not suer from this multiplicative eect of wrong

hyp erparameter estimates as its neural network function do es not dep end on the hyp er

parameters so there is no need to do backpropagation of second derivatives Indeed the

second derivative of the p otential energy in Eqn is prop ortional to just the precision

of the hyp erparameter so if our estimate of the hyp erparameter is k times to o small the

p

k Furthermore the parameters contain stepsize is only going to go up by a factor of

information ab out the value of the hyp erparameter the Dynamical A metho d estimates

ahyp erparameter as its p osterior mean given its weights

Chapter Discussion

Other Implications of the Current Heuristics

The fact that the current heuristics uses the prior means of the hyp erparameters as esti

mates of the hyp erparameters themselves during stepsize calculations has the implication

that when the Dynamical B metho d is used the prior means must b e carefully selected

to b e close to values where the hyp erparameters have high p osterior probabilityThis

allows the stepsizes to b e accurate when moving ab out in areas of high p osterior proba

bility If the prior means are badly set exploration of the p osterior is exp ected to b ecome

very inecient

Ahybrid Monte Carlo metho d that computes bad stepsizes in some region of state

space may usually b e unable to enter that region b ecause it rejects once a tra jectory

enters it Furthermore once having entered that region it is obliged to stay in there a

long time through its high rejection rate in that region in order to comp ensate for its

inabilitytoenter that region in the rst place This leads to high auto correlations

We might ask if the B metho d actually suers from this problem If it do es it

is small as an eect like this was not noticed the marginal p osterior distributions of

the hyp erparameters obtained by the B metho d matches those from the old Fig

However this is no guarantee that it will work similarly well for other problems

Conclusion

In this thesis wehaveintro duced a new way of learning the hyp erparameters in a neural

network mo del using Hamiltonian dynamics Wehave presented twoversions of the new

metho d the Dynamical A metho d and the Dynamical B metho d which is the former

with weights reparameterized to enhance movementbetween large and small values of

the hyp erparameters Wehave also dev elop ed p erformance evaluation metho dologies

that measure the rate of exploration of the p osterior while accounting for the dierent

compute times required for the dierent metho ds The observation that some parts of

Chapter Discussion

state space mighthave a large eect on the results then led us to a stratied version

of the p erformance measures After extensive testing wehave found that these same

regions of state space can cause Dynamical A to b ecome very inecient for the reasons

that led us to formulate Dynamical B However wehavealsoshown that Dynamical B

do es not show a measurable p erformance improvementover the old metho d

The Dynamical A metho d as it currently stands suers from exp ected ineciencies

butitwas hop ed that Dynamical B would overcome them and yield b etter p erformance

Instead there is currently no reason to recommend either new metho d over the old one

We strongly b elieve that the Dynamical B metho ds Achilles heel is the fact that

its rejection rate increases with tra jectory length If longer tra jectories can b e achieved

while keeping rejection rates low it is exp ected that the Dynamical B metho d can b ecome

signicantly faster Toachieve this future work might fo cus on an improved parameteri

zation of the weights andor more accurate hyp erparameter stepsize heuristics Also the

stepsize heuristics can p otentially b e simplied to allow more leapfrog steps to b e taken

Ultimately our eorts are b eing hamp ered by the fact that the volume of the state

space under the p osterior having large hyp erparameter variance is huge and low density

while the region having small hyp erparameter variance is small and very high density

and hybrid Monte Carlo do es not movewell b etween these twot yp es of regions Doing

nothing ab out this leads to the Dynamical A metho d whichwehaveshown can have

severe ineciencies in certain regions On the other hand our eort to reparameterize

the weights to tackle this problem leads to the hyp erparameters b eing confounded with

the weights in the computation of the network output and that is the cause of our

inabilitytohave long tra jectories while keeping rejection rates down It may b e that we

are pushing against the inherent limitations of the hybrid Monte Carlo metho d here but

we nevertheless hop e that further work will overcome the present diculties

App endix A

Preservation of Phase Space Volume

Under Hamiltonian Dynamics

Here weshow the wellknown result that Hamiltonian dynamics keeps the Hamiltonian

H as well as phase space volume constant

Hamiltonian dynamics is characterized by

H

q

p

A

H

p

q

where q and p are the state and the momentum variables resp ectively

The following shows that Hamiltonian dynamics keeps H constant

H H dH

q p

dt q p

H H H H

A

q p p q

Toshow that Hamiltonian dynamics conserves phase space volume consider the phase

space owq p which denes a vector eld in the phase space q p The fact that phase

Appendix A Preservation of Phase Space Volume Under Hamiltonian Dynamics

space volume is conserved is due to the fact that the divergence of this vector eld is

rq p q p

q p

H H

q p p q

A

 

H H

qp pq

App endix B

Pro of of Thorem Deterministic

Prop osals for Metrop olis Algorithm

Here weprove Theorem from Section

First we need the following denition

Denition Detailed Balance We say that the transition probabilities T x A de

nedforallpoints x and al l sets A satises detailed balance with respect to the density

x if given any two sets A and B

Z Z

xT x B dx xT x Adx B

A B

In words the detailed balance condition says that in equilibrium the probabilityof

starting in A and moving to B in one transition is exactly equal to the probabilityof

starting in B and moving to A

Recall that to say that a Markovupdateleaves a distribution xin variantistosay

that the total probabilitymass A in some arbitrary set A is unchanged by the Markov

transition That is

Z

xT x Adx A B

R

Appendix B Proof of Thorem Deterministic Proposals for Metropolis Algorithm

B’

M(B)

B

A

M(A)

A’

Figure B A maps to A under M and B maps to B under M

where R is the state space

It is easy to see that if a Markov transition satises detailed balance with resp ect to

x then it leaves xinvariant we need only set B to R to see this Thus we only

need to prove that the Metrop olis algorithm with deterministic prop osals satisfying the

two conditions in Theorem yields Markov up dates that satisfy detailed balance with

resp ect to the desired distribution x That is our aim in the b elow discussion

Consider the deterministic mapping M R RLetA R and B R Under M

the image of A might in general have some part outside B and the image of B might

have some part outside Aasshown in Fig B

Assume that the mapping M istheinverse of itself so that M M x xwemust

havethattheM A B B A Supp ose not Then there is a p oint x A B

that maps into B A whichweshowasM x in Fig B M xmust b e in A as

x A But the fact that x is in B means that it is the image under M of some p oint

y in B so that M y x If indeed x maps on to the p ointM x then M M y y

for y is in B but M x is not Since this violates the assumption that M is its own

Appendix B Proof of Thorem Deterministic Proposals for Metropolis Algorithm

B’

x

B

A

M(x)

"M(x)"

A’

Figure B x A B must map onto B A

inverse we conclude that M A B B A Note that M B A A B also due

to the selfinverting assumption on M

With this fact in hand we are ready to showhowtoachieve detailed balance for the

Metrop olis algorithm We assume selfinverting deterministic prop osals Since the total

probabilitymassowing from A to B takes place in the ow from A B we can rewrite

the left hand side of Eqn B

Z Z

xT x B dx xT x B dx



AB A

Z

M x

dx B xmin

x



AB

Z

min xM xdx



AB

where wehave used the Metrop olis transition probability T x B minM x x

Appendix B Proof of Thorem Deterministic Proposals for Metropolis Algorithm

We can then rewrite the right hand side of Eqn B

Z Z

y T y Ady y T y Ady



B B A

Z

M y

dy y min

y



B A

Z

M y

y min dy

y



M AB 

Z

min y M y dy



M AB 

Z

y

min M xM M x dx using y M x

x



AB

Z

y

min M xx dx

x



AB

B

Comparing the expressions resulting from manipulating the left and the righthand

sides of the detailed balance condition we see that they are equal if the Jacobian

j y xj Thus detailed balance with resp ect to xisachieved if the Metrop olis

algorithm is used with deterministic prop osals that are reversible selfinverting and that

have Jacobian and we are done

App endix C

Preservation of Phase Space Volume

Under Leapfrog Up dates

In this App endix weshow that leapfrog up dates preserve phase space volume

Consider the simultaneous up date of variables such that each up date do es not

dep end on the variable it up dates but dep ends on the value of the other variable

x x f y

C

y y g x

It can easily b e shown that the Jacobian of such an up date is f y g x so the

up date do es not preserve phase space volume in general On the other hand consider

two sequential up dates where each up date also dep ends on the variable up dated bythe

other only but the up dates are p erformed one after another

x x f y

C

y y g x y g x f y

The Jacobian of such an up date can b e shown to b e identically equal to and so

it preserves phase space volume This is simply a demonstration of the fact that each

sequential up date of a variable that do es not dep end on itself amounts to a shear in the

Appendix C Preservation of Phase Space Volume Under Leapfrog Updates

direction of that variable Thus each sequential up date preserves phase space volume

and indeed a chain of such up dates do es as well

The leapfrog up dates are

E

p t qt for each i d p t

i i

q

i

p t

i

C

for each i d q t q t

i i

m

i

E

p t p t qt for each i d

i i

q

i

The up date for p app ears to b e simultaneous in that p is up dated without reference

i

to any newlyup dated comp onents of pHowever in the leapfrog up date the up date of

each p do es not actually dep end on any other comp onentof p so the up date of each p

i i

can b e viewed as b eing sequential and conserving phase space volume The same applies

to q i

Bibliography

C M Bishop Neural Networks for Pattern Recognition Oxford University Press

W L Buntine and A S Weigend Bayesian backpropagation Complex Systems

G Cyb enko Approximation by sup erp ositions of a sigmoid function Mathematics of

Control Signals and Systems

L Devroye NonUniform Random Variate Generation SpringerVerlag

S Duane A D KennedyBJPendleton and D Roweth Hybrid monte carlo Physics

Letters B September

B Efron and R Tibshirani AnIntroduction to the Bootstrap Chapman and Hall

W Feller AnIntroduction to Probability Theory and Its Applications John Wiley and

Sons Inc

D J C MacKay Bayesian Methods for Adaptive Models PhD thesis California Institute

of Technology

D J C MacKay Comparison of approximate metho ds for handling hyp erparameters

Neural Computation

P B Mackenzie An improved hybrid monte carlo metho d Physics Letters B

August

BIBLIOGRAPHY

N Metrop olis A W Rosenbluth M N Rosenbluth A H Teller and E Teller Equa

tion of state calculations by fast computing machines Journal of Chemical Physics

PMuller and D R Insua Issues in bayesian analysis of neural network mo dels Neural

Computation

R M Neal Probabilistic inference using markovchain monte carlo metho ds Technical

Rep ort CRGTR UniversityofToronto September

R M Neal Bayesian Learning for Neural Networks SpringerVerlag

J S Rosenthal A review of asymptotic convergence of general state space markovchains

March

D E Rumelhart G E Hinton and R J Williams Learning representations by back

propagating errors Nature