Hamiltonian Monte Carlo Without Detailed Balance
Total Page:16
File Type:pdf, Size:1020Kb
Hamiltonian Monte Carlo Without Detailed Balance Jascha Sohl-Dickstein [email protected] Stanford University, Palo Alto. Khan Academy, Mountain View Mayur Mudigonda [email protected] Redwood Institute for Theoretical Neuroscience, University of California at Berkeley Michael R. DeWeese [email protected] Redwood Institute for Theoretical Neuroscience, University of California at Berkeley Abstract rithm converges to the correct distribution is via a concept We present a method for performing Hamiltonian known as detailed balance. Sampling algorithms based on Monte Carlo that largely eliminates sample re- detailed balance are powerful because they allow samples jection. In situations that would normally lead to from any target distribution to be generated from almost rejection, instead a longer trajectory is computed any proposal distribution, using for instance Metropolis- until a new state is reached that can be accepted. Hastings acceptance criteria (Hastings, 1970). However, This is achieved using Markov chain transitions detailed balance also suffers from a critical flaw. Pre- that satisfy the fixed point equation, but do not cisely because the forward and reverse transitions occur satisfy detailed balance. The resulting algorithm with equal probability, detailed balance driven samplers go significantly suppresses the random walk behav- backwards exactly as often as they go forwards. The state ior and wasted function evaluations that are typ- space is thus explored via a random walk over distances ically the consequence of update rejection. We longer than those traversed by a single draw from the pro- posal distribution. A random walk only travels a distance demonstrate a greater than factor of two improve- 1 ment in mixing time on three test problems. We dN 2 in N steps, where d is the characteristic step length. release the source code as Python and MATLAB The current state-of-the-art sampling algorithm for proba- packages. bility distributions with continuous state spaces is Hamilto- nian Monte Carlo (HMC) (Duane et al., 1987; Neal, 2010). By extending the state space to include auxiliary momen- 1. Introduction tum variables, and then using Hamiltonian dynamics to tra- High dimensional and otherwise computationally expen- verse long iso-probability contours in this extended state sive probabilistic models are of increasing importance for space, HMC is able to move long distances in state space in such diverse tasks as modeling the folding of proteins a single update step. However, HMC still relies on detailed (Schutte¨ & Fischer, 1999), the structure of natural images balance to accept or reject steps, and as a result still behaves (Culpepper et al., 2011), or the activity of networks of neu- like a random walk – just a random walk with a longer step rons (Cadieu & Koepsell, 2010). length. Previous attempts to address this have combined multiple Markov steps that individually satisfy detailed bal- Sampling from the described distribution is typically the ance into a composite step that does not (Horowitz, 1991), bottleneck when working with these probabilistic models. with limited success (Kennedy & Pendleton, 1991). Sampling is commonly required when training a proba- bilistic model, when evaluating the model’s performance, The No-U-Turn Sampler (NUTS) sampling package (Hoff- when performing inference, and when taking expectations man & Gelman, 2011) and the windowed acceptance (MacKay, 2003). Therefore, work that improves sampling method of (Neal, 1994) both consider Markov transitions is fundamentally important. within a set of discrete states generated by repeatedly sim- ulating Hamiltonian dynamics. NUTS generates a set of The most common way to guarantee that a sampling algo- candidate states around the starting state by running Hamil- Proceedings of the 31 st International Conference on Machine tonian dynamics forwards and backwards until the trajec- Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- tory doubles back on itself, or a slice variable constraint is right 2014 by the author(s). violated. It then chooses a new state at uniform from the Hamiltonian Monte Carlo Without Detailed Balance candidate states. In windowed acceptance, a transition is fixed point equation (Equation2). Detailed balance guar- proposed between a window of states at the beginning and antees that if samples are drawn from the equilibrium dis- end of a trajectory, rather than the first state and last state. tribution p (x), then for every pair of states x and x0 the With the selected window, a single state is then chosen us- probability of transitioning from state x to state x0 is iden- ing Boltzmann weightings. Both NUTS and the windowed tical to that of transitioning from state x0 to x, acceptance method rely on detailed balance to choose the candidate state from the discrete set. p (x) T (x0jx) = p (x0) T (xjx0) : (3) Here we present a novel discrete representation of the HMC By substitution for T (x0jx) in the left side of Equation2, state space and transitions. Using this representation, we it can be seen that if Equation3 is satisfied, then the fixed derive a method for performing HMC while abandoning point equation is also satisfied. detailed balance altogether, by directly satisfying the fixed An appealing aspect of detailed balance is that a tran- point equation restricted to the discrete state space. As a sition distribution satisfying it can be easily constructed result, random walk behavior in the sampling algorithm is from nearly any proposal distribution, using Metropolis- greatly reduced, and the mixing rate of the sampler is sub- Hastings acceptance/rejection rules (Hastings, 1970). A stantially improved. primary drawback of detailed balance, and of Metropolis- Hastings, is that the resulting Markov chains always engage 2. Sampling in random walk behavior, since by definition detailed bal- ance depends on forward and reverse transitions happening We begin by briefly reviewing some key concepts related with equal probability. to sampling. The goal of a sampling algorithm is to draw characteristic samples x 2 RN from a target probability The primary advance in this paper is demonstrating how distribution p (x). Without loss of generality, we will as- HMC sampling can be performed without resorting to de- sume that p (x) is determined by an energy function E (x), tailed balance. 1 p (x) = exp (−E (x)) : (1) Z 3. Hamiltonian Monte Carlo 2.1. Markov Chain Monte Carlo Hamiltonian Monte Carlo (HMC) can traverse long dis- tances in state space with single Markov transitions. It Markov Chain Monte Carlo (MCMC) (Neal, 1993) is does this by extending the state space to include auxiliary commonly used to sample from probabilistic models. In momentum variables, and then simulating Hamiltonian dy- MCMC a chain of samples is generated by repeatedly namics to move long distances along iso-probability con- drawing new samples x0 from a conditional probability dis- tours in the expanded state space. tribution T (x0jx), where x is the previous sample. Since T (x jx) is a probability density over x , R T (x jx) dx = 0 0 0 0 3.1. Extended state space 1 and T (x0jx) ≥ 0. The state space is extended by the addition of momentum 2.2. Fixed Point Equation variables v 2 RN , with identity-covariance Gaussian dis- tribution, An MCMC algorithm must satisfy two conditions in or- der to generate samples from the target distribution p (x). N 1 T p (v) = (2π)− 2 exp − v v : (4) The first is mixing, which requires that repeated application 2 of T (x0jx) must eventually explore the full state space of p (x). The second condition is that the target distribution We refer to the combined state space of x and v as ζ, such f g p (x) must be a fixed point of T (x0jx). This second condi- that ζ = x; v . The corresponding joint distribution is tion can be expressed by the fixed point equation, N (2π)− 2 Z p (ζ) = p (x; v) = p (x) p (v) = exp (−H (ζ)) ; p (x) T (x0jx) dx = p (x0) ; (2) Z (5) which requires that when T (x0jx) acts on p (x), the result- 1 H (ζ) = H (x; v) = E (x) + vT v: (6) ing distribution is unchanged. 2 has the same form as total energy in a physical sys- 2.3. Detailed Balance H (ζ) tem, where E (x) is the potential energy for position x and 1 T Detailed balance is the most common way of guaranteeing 2 v v is the kinetic energy for momentum v (mass is set that the Markov transition distribution T (x0jx) satisfies the to one). Hamiltonian Monte Carlo Without Detailed Balance In HMC samples from p (x) are generated by drawing sam- ples from the joint distribution p (x; v), and using only the x variables as samples from the desired distribution. ⇣ 1 F⇣ L− F⇣ L⇣ 3.2. Hamiltonian dynamics L⇣ Hamiltonian dynamics govern how physical systems F⇣ ⇣ ⇣ evolve with time. It might be useful to imagine the trajec- tory of a skateboarder rolling in an empty swimming pool. 1 As she rolls downwards she exchanges potential energy for LF⇣ L− ⇣ kinetic energy, and the magnitude of her velocity increases. ⇣ As she rolls up again she exchanges kinetic energy back for R (β) ⇣ potential energy. In this fashion she is able to traverse long (a) (b) distances across the swimming pool, while at the same time maintaining constant total energy over her entire trajectory. Figure 1. (a) The action of operators involved in Hamiltonian Monte Carlo (HMC). The base of each red or green arrow repre- In HMC, we treat H (ζ) as the total energy of a physi- sents the position x, and the length and direction of each of these cal system, with spatial coordinate x, velocity v, potential arrows represents the momentum v. The flip operator F reverses 1 T energy E (x), and kinetic energy 2 v v.