A Gradient Based Strategy for Hamiltonian Monte Carlo Hyperparameter Optimization

A Gradient Based Strategy for Hamiltonian Monte Carlo Hyperparameter Optimization Andrew Campbell * 1 Wenlong Chen * 2 Vincent Stimper * 3 4 Jose´ Miguel Hernandez-Lobato´ 3 Yichuan Zhang 5 Abstract behaviour and highly correlated samples. Furthermore, the step sizes may need to be tuned on a dimension by dimen- Hamiltonian Monte Carlo (HMC) is one of the sion basis depending on the scale and shape of the target most successful sampling methods in machine distribution (Neal, 2011). learning. However, its performance is signif- cantly affected by the choice of hyperparameter Traditionally, samples are produced by running a single values. Existing approaches for optimizing the HMC chain for a long time compensating for imperfect hy- HMC hyperparameters either optimize a proxy perparameter choices through ergodicity. However, it is be- for mixing speed or consider the HMC chain as comingly increasingly attractive to run many short MCMC an implicit variational distribution and optimize chains in parallel to make better use of parallel compute a tractable lower bound that can be very loose hardware (Hoffman & Ma, 2020). In this case, it is even in practice. Instead, we propose to optimize an more important to choose good hyperparameters that en- objective that quantifes directly the speed of con- courage fast mixing. This approach also offers the novel vergence to the target distribution. Our objective opportunity to choose different hyperparameter values for can be easily optimized using stochastic gradient each step in the chain, providing more tuning fexibility. descent. We evaluate our proposed method and Fully exploiting this opportunity in practice is a challenge. compare to baselines on a variety of problems MCMC hyperparameters are commonly tuned according including sampling from synthetic 2D distribu- to the average acceptance probability but it is unclear how tions, reconstructing sparse signals, learning deep this can scale up to tuning every hyperparameter individ- latent variable models and sampling molecular ually. Looking to backpropagation’s success in training confgurations from the Boltzmann distribution neural networks, a gradient based method would seem most of a 22 atom molecule. We fnd that our method appropriate. Unfortunately, there is no universal tractable is competitive with or improves upon alternative metric to quantify the performance of HMC that we can baselines in all these experiments. optimize for. We must therefore make a choice about which approximate metric is best suited for this application. 1. Introduction One approach is to use a proxy for the mixing speed of the chain. Levy et al. (2018) make use of a variation of the ex- Hamiltonian Monte Carlo (HMC) is a very popular Markov pected squared jumped distance (Pasarica & Gelman, 2010) Chain Monte Carlo (MCMC) method for generating approx- to encourage proposals to make large moves in space. Alter- imate samples from complex probability distributions. It natively, one can draw upon ideas from Variational Inference fnds wide use in machine learning (Neal, 2011) and across (VI) (Jordan et al., 1999), which matches an approximate the broader statistical community (Carpenter et al., 2017). distribution q to the target distribution p by maximizing Unfortunately, HMC’s performance depends heavily on the the Evidence Lower Bound or ELBO. This is equivalent to choice of hyperparameters such as the proposal step size. minimizing the KL-divergence between q and p. A step size that is too large can lead to unstable dynamics, while a step size that is too small may result in random walk Salimans et al. (2015); Wolf et al. (2016) use VI to obtain a training objective in which q is the joint distribution of all *Equal contribution 1Department of Statistics, University HMC samples along the chain. For tractability, they intro- of Oxford 2Baidu, Inc. 3Department of Engineering, Univer- duce an auxiliary inference distribution approximating the 4 sity of Cambridge Max Planck Institute for Intelligent Systems reverse dynamics of the chain. The looseness of their ELBO 5Boltzbit Ltd. Correspondence to: Andrew Campbell <camp- [email protected]>. then depends on the KL-divergence between the auxiliary inference distribution and the true reverse dynamics. As Proceedings of the 38 th International Conference on Machine the chain length increases so does the dimensionality of Learning, PMLR 139, 2021. Copyright 2021 by the author(s). A Gradient Based Strategy for Hamiltonian Monte Carlo Hyperparameter Optimization these distributions, resulting in a looser and looser bound. tegrator, with leapfrog (Hairer et al., 2003) being a popular This is problematic as, for longer chains, the optimized hy- choice. L leapfrog updates are taken to propose a new state, perparameters are encouraged to ft the imperfect auxiliary with the update equations at step k being distribution as opposed to the target. In practice, Salimans et al. avoid this problem by only considering very short 1 ∗ 1 HMC chains, which limits the fexibility of their method. νk+ = νk + ◦ rxk log p (xk) ; 2 2 We overcome these issues by considering the marginal dis- 1 xk+1 = xk + νk+ 1 ◦ ◦ ; tribution of the fnal state in the chain as our variational 2 m q. In this case, the ELBO can be broken down into the 1 ∗ 1 νk+1 = νk+ + ◦ rxk+1 log p (xk+1) ; sum of the tractable expectation with respect to q of the 2 2 log target density (up to a normalization constant) and the 1 1 1 intractable entropy of q. During optimization, the entropy where = ( ;:::; ) and ◦ denotes element wise m m1 mn term prevents a fully fexible q from collapsing to a point multiplication. The step size, , and the mass, m, are hy- mass maximizing the log target density. However, a HMC perparameters that need to be tuned for each problem the chain, by construction, cannot collapse to such a point mass. method is applied to. We note that in the usual defnition of We argue that optimization can still be successful whilst HMC, a single scalar valued is used. Our use of a vector ignoring the entropy term, provided the initial distribution implies a different step size in each dimension which, with of the chain is broad enough. In practice, we achieve this by proper tuning, can improve performance by accounting for infating the initial proposal distribution by a scaling that is different scales across dimensions. The use of does mean independently tuned by minimizing a discrepancy measure the procedure can no longer be interpreted as simulating between p and q (Gong et al., 2021). Hamiltonian Dynamics, however, it can still be used as a We empirically compare our method with alternative base- valid HMC proposal (Neal, 2011). Further, and m both lines on a wide range of tasks. We frst consider sampling correspond to an element-wise rescaling of x and so tuning from a collection of 2D toy distributions. We then focus on both does not increase the expressivity of the method. How- more challenging approximate inference problems: recon- ever, we found empirically this overparameterization aided structing sparse signals, training deep latent variable models optimization. We do not consider the problem of choosing on MNIST and FashionMNIST and fnally, sampling molec- L in this work. ular confgurations from the Boltzmann distribution of the 22 atom molecule Alanine Dipeptide. Our results show 2.2. Variational Inference that our method is competitive with or can improve upon VI approximates the target p(x) with a tractable distribution alternative tuning methods for HMC on all these problems. qφ(x) parameterized by φ. The value of φ is chosen as to minimise the Kullback-Leibler divergence with the target, � 2. Background DKL qφ(x)jjp(x) . As, typically, we know p(x) only up to a normalization constant, we can equivalently choose φ by 2.1. Hamiltonian Monte Carlo maximising the tractable ELBO: HMC (Neal, 1993) aims to draw samples from an n- 1 ∗ ∗ dimensional target distribution p(x) = p (x) where Z is � Z log Z − DKL qφ(x)jjp(x) = Eqφ(x) log p (x) − log qφ(x) : the (usually unknown) normalization constant. It introduces n an auxiliary variable ν 2 R , referred to as the momentum, which is distributed according to N �ν; 0; diag(m) , 3. Expected Log-Target Maximization with the resulting method sampling on the extended space (x; ν). HMC progresses by frst sampling an initial state VI tunes the parameters of an approximate distribution to from some initial distribution and then iteratively proposing make it closer to the target. We build on this to obtain a new states and accepting/rejecting them according to an tractable objective for HMC hyperparameter optimization. In the parallel HMC setting, we run multiple parallel HMC acceptance probability. To propose a new state, frst, a new value for the momentum is drawn from N �ν; 0; diag(m) , chains and take the fnal sample in each chain. Viewing then, we simulate Hamiltonian Dynamics with Hamiltonian, this from the VI perspective, these fnal samples would ∗ 1 T −1 be independent samples from an implicit variational dis- H(x; ν) = −log p (x) + 2 ν diag(m) ν arriving at new state (x0; ν0). This new state is accepted with probability tribution. If each chain starts at a sample from an initial 0 0 (0) min 1; exp(−H(x ; ν ) + H(x; ν)) . Otherwise we reject distribution q (x) and then runs T accept/reject cycles, (T ) the proposed state and remain at the starting state. The we can denote the resulting implicit distribution as qφ (x), Hamiltonian Dynamics are simulated using a numerical in- where φ now represents the step-by-step hyperparameters φ = f(1:T ); m(1:T )g. Ideally, φ would be chosen as to A Gradient Based Strategy for Hamiltonian Monte Carlo Hyperparameter Optimization −1.0 maximizing the expected log target density under the fnal state of the chain: −1.5 ∗ ∗ φ = (T ) p (x) : argmax Eq (x) log (1) φ φ −2.0 HMC with ID1 pre-training Although HMC does have a regularization effect, remov- E[log p(x)] HMC with ID1 post-training ing the entropy term does have some implications that we −2.5 HMC with ID2 pre-training must consider.

Load more