A Framework for Adaptive MCMC Targeting Multimodal Distributions

A Framework for Adaptive MCMC Targeting Multimodal Distributions Emilia Pompe Department of Statistics, University of Oxford Palermo, European Meeting of Statisticians 2019 joint work with Chris Holmes (Oxford) and KrysLatuszyński(Warwick) Multimodality { why is this still an open problem? How to sample efficiently from a multimodal distribution? • Identify the regions of high probability. • Move between these regions. • Move efficiently within each mode. • Adapt the parameters when the algorithm runs. • Use parallel computations. 1 Multimodality { why is this still an open problem? How to sample efficiently from a multimodal distribution? • Identify the regions of high probability. • Move between these regions. • Move efficiently within each mode. • Adapt the parameters when the algorithm runs. • Use parallel computations. 1 Multimodality { why is this still an open problem? How to sample efficiently from a multimodal distribution? • Identify the regions of high probability. • Move between these regions. • Move efficiently within each mode. • Adapt the parameters when the algorithm runs. • Use parallel computations. 1 Multimodality { why is this still an open problem? • Identify the regions of high probability. • Move between these regions. • Move efficiently within each mode. • Adapt the parameters when the algorithm runs. • Use parallel computations. 1 Tempering-based methods e.g. simulated tempering [Marinari and Parisi (1992)], parallel tempering [Geyer (1991)] Problems: • Proven to mix poorly when π has both narrow and wide modes Woodard et al. (2009). • No mechanism of remembering where the modes are. • The same covariance matrix used everywhere. 2 Multimodality: why 1 still an open issue? Jumping Adaptive 2 Multimodal Sampler (JAMS) 3 Illustrations 4 Ergodicity of JAMS 3 JAMS: (3 key ideas) Find the modes and then sample! 4 JAMS: (3 key ideas) Propose within mode and between-mode moves separately. 5 JAMS: (3 key ideas) Use different covariance matrices in different regions, and learn them while the algorithm runs. 6 Mode-finding 1. Sample n point (from the prior). 2. Run n minimisation procedures for − log(π) in parallel (BFGS { a Quasi Newton method). 3. Merge vectors that correspond to the same local maxima of π (using Hessians). 7 Mode-finding 1. Sample n point (from the prior). 2. Run n minimisation procedures for − log(π) in parallel (BFGS { a Quasi Newton method). 3. Merge vectors that correspond to the same local maxima of π (using Hessians). 7 Mode-finding 1. Sample n point (from the prior). 2. Run n minimisation procedures for − log(π) in parallel (BFGS { a Quasi Newton method). 3. Merge vectors that correspond to the same local maxima of π (using Hessians). 7 Mode-finding 1. Sample n point (from the prior). 2. Run n minimisation procedures for − log(π) in parallel (BFGS { a Quasi Newton method). 3. Merge vectors that correspond to the same local maxima of π (using Hessians). 7 Target distribution π~ Set of modes: I = fµ1; : : : ; µN g. Target distribution on an augmented state space X × I: wi Qi (µi ; Σi )(x) π~(x; i) = π(x)P ; j2I wj Qj (µj ; Σj )(x) PN for i=1 wi = 1 and Qi (µi ; Σi ) { an elliptical distribution, e.g. Gaussian or multivariate t. π(x) is the marginal distribution ofπ ~ w.r.t. x. N X wi Qi (µi ; Σi )(x) π(x)P = π(x): wj Qj (µj ; Σj )(x) i=1 j2I 8 Target distribution π~ Set of modes: I = fµ1; : : : ; µN g. Target distribution on an augmented state space X × I: wi Qi (µi ; Σi )(x) π~(x; i) = π(x)P ; j2I wj Qj (µj ; Σj )(x) PN for i=1 wi = 1 and Qi (µi ; Σi ) { an elliptical distribution, e.g. Gaussian or multivariate t. π(x) is the marginal distribution ofπ ~ w.r.t. x. N X wi Qi (µi ; Σi )(x) π(x)P = π(x): wj Qj (µj ; Σj )(x) i=1 j2I 8 Local and global moves Suppose the current state of the chain is (x; i). • Local Move (with probability 1 − "): • preserves the mode ((x; i) ! (y; i)) • a Random Walk Metropolis step with the proposal distribution 2 N x; 2:38 =dΣi (or the t distribution). • Jump Move (with probability "): • a new mode k is proposed with probability aik ((x; i) ! (y; k)) • two ways of proposing y: • deterministic jump: y such that −1 −1 (y − µk )Σk (y − µk ) = (x − µi )Σi (x − µi ). • independence sampler: y ∼ N (µk ; Σk ). The acceptance probabilities are such that the detailed balance holds forπ ~. 9 Local and global moves Suppose the current state of the chain is (x; i). • Local Move (with probability 1 − "): • preserves the mode ((x; i) ! (y; i)) • a Random Walk Metropolis step with the proposal distribution 2 N x; 2:38 =dΣi (or the t distribution). • Jump Move (with probability "): • a new mode k is proposed with probability aik ((x; i) ! (y; k)) • two ways of proposing y: • deterministic jump: y such that −1 −1 (y − µk )Σk (y − µk ) = (x − µi )Σi (x − µi ). • independence sampler: y ∼ N (µk ; Σk ). The acceptance probabilities are such that the detailed balance holds forπ ~. 9 Updating the parameters • Update weights wi so that they represent proportions of samples in each mode observed. • When a sample (x; i) appears, recalculate Σi . Are the samples (x; i) collected from the region around µi ? 10 Controlling the modes Suppose we propose a local move (x; i) ! (y; i). 11 Controlling the modes Suppose we propose a local move (x; i) ! (y; i). Acceptance probability: 2 3 P π~(y) 6 π(y) Q (µ ; Σ )(y) wj Qj (µj ; Σj )(x)7 6 i i i j2I 7 min 1; = min 61; P 7 π~(x) 4 π(x) Qi (µi ; Σi )(x) j2I wj Qj (µj ; Σj )(y)5 | {z } tiny Unlikely to accept (y; i). 11 Illustrations • Example studied by Woodard et al. (2009). 0 1 0 1 1 2 1 2 π(x) = N @− (1;:::; 1); σ1Id A + N @(1;:::; 1); σ2Id A ; 2 | {z } 2 | {z } d d 2 p 2 p σ1 = 0:5 d=100 and σ2 = d=100 (up to dimension 200). • Mixture of t-distributions and banana-shaped distributions (up to dimension 80). 12 Sensor network localisation (1) 11 sensors, on [0; 1]2, location of 8 of them unknown. 2 kxi −xj k • xi observed from xj with probability exp − 2×0:32 • The observed distance yij between two sensors follows 2 yij wij = 1 ∼ N kxi − xj k; 0:02 : • Uniform prior on x1;:::; x8. JAMS versus Adaptive Parallel Tempering (APT) [Miasojedow et al. (2013)]? JAMS: 500,000 iterations, 10,000 mode finding runs APT: 500,000 × 4 temperatures 13 Sensor network localisation (2) 14 Sensor network localisation (2) More simulations in arXiv:1812.02609. 14 Ergodicity of the algorithm wi Qi (µi ; Σi )(x) π~(x; i) = π(x)P ; j2I wj Qj (µj ; Σj )(x) 15 Ergodicity of the algorithm wγ;i Qi (µi ; Σγ;i )(x) π~γ(x; i) = π(x)P ; j2I wγ;j Qj (µj ; Σγ;j )(x) NOT an adaptive MCMC algorithm 15 Multimodality: why 1 still an open issue? Jumping Adaptive 2 Multimodal Sampler (JAMS) 3 Illustrations 4 Ergodicity of JAMS 16 Auxiliary Variable Adaptive MCMC class Augmented state space X~ = X × Φ. ~n ~ 1. limn!1 kPγ (~x; ·) − π~γ(·)kTV = 0 for all x~ := (x; φ) 2 X . 2. Distributionsπ ~γ such that: π~γ(B × Φ) = π(B) for every B 2 B(X ) and γ 2 Y: Convergence ~ ~ sup P Xn 2 (B × Φ) X0 =x ~; Γ0 = γ − π(B) ! 0 as n ! 1: B2B(X ) 17 Auxiliary Variable Adaptive MCMC class Augmented state space X~ = X × Φ. ~n ~ 1. limn!1 kPγ (~x; ·) − π~γ(·)kTV = 0 for all x~ := (x; φ) 2 X . 2. Distributionsπ ~γ such that: π~γ(B × Φ) = π(B) for every B 2 B(X ) and γ 2 Y: Convergence ~ ~ sup P Xn 2 (B × Φ) X0 =x ~; Γ0 = γ − π(B) ! 0 as n ! 1: B2B(X ) 17 Ergodicity of Adaptive MCMC (a) (Diminishing adaptation.) The random variable Dn := sup kPΓn+1 (x; ·) − PΓn (x; ·)kTV x2X converges to 0 in probability. (b) (Containment.) For all " > 0 and all δ > 0, there exists N = N("; δ) such that P (M"(Xn; Γn) > NjX0 = x; Γ0 = γ) ≤ δ for all n 2 N, where k M"(x; γ) = inffk ≥ 1 : kPγ (x; ·) − π(·)kTV ≤ "g: 18 Ergodicity of Adaptive MCMC (a) (Diminishing adaptation.) The random variable Dn := sup kPΓn+1 (x; ·) − PΓn (x; ·)kTV x2X converges to 0 in probability. (b) (Containment.) For all " > 0 and all δ > 0, there exists N = N("; δ) such that P (M"(Xn; Γn) > NjX0 = x; Γ0 = γ) ≤ δ for all n 2 N, where k M"(x; γ) = inffk ≥ 1 : kPγ (x; ·) − π(·)kTV ≤ "g: 18 Ergodicity of Adaptive MCMC [Roberts and Rosenthal (2007)] Under the containment and diminishing adaptation conditions, adaptive MCMC algorithms are ergodic. Ergodicity of Auxiliary Variable Adaptive MCMC [P. et al. (2018)] Under the containment condition ~ ~ P M"(Xn; Γn) > NjX0 =x ~; Γ0 = γ ≤ δ for all n 2 N; ~k where M"(~x; γ) = inffk ≥ 1 : kPγ(~x; ·) − π~γ(·)kTV g ≤ "; and diminishing adaptation, ~ ~ sup kPΓn+1 (~x; ·) − PΓn (~x; ·)kTV ! 0; x~2X~ Auxiliary Variable adaptive MCMC algorithms are ergodic. 19 WLLN for Auxiliary Variable Adaptive MCMC Under containment and diminishing adaptation the Weak Law of Large Numbers holds for Auxiliary Variable adaptive MCMC algorithms for bounded functions g, i.e. n 1 X g(X ) ! π(g) in probability: n i i=1 20 Diminishing adaptation + containment Diminishing adaptation: adapt increasingly rarely (both classes of algorithms) [Chimisov et al.

Load more