A Framework for Adaptive MCMC Targeting Multimodal Distributions
Emilia Pompe Department of Statistics, University of Oxford Palermo, European Meeting of Statisticians 2019 joint work with Chris Holmes (Oxford) and KrysLatuszy´nski(Warwick) Multimodality – why is this still an open problem?
How to sample efficiently from a multimodal distribution?
• Identify the regions of high probability. • Move between these regions. • Move efficiently within each mode. • Adapt the parameters when the algorithm runs. • Use parallel computations.
1 Multimodality – why is this still an open problem?
How to sample efficiently from a multimodal distribution?
• Identify the regions of high probability. • Move between these regions. • Move efficiently within each mode. • Adapt the parameters when the algorithm runs. • Use parallel computations.
1 Multimodality – why is this still an open problem?
How to sample efficiently from a multimodal distribution?
• Identify the regions of high probability. • Move between these regions. • Move efficiently within each mode. • Adapt the parameters when the algorithm runs. • Use parallel computations.
1 Multimodality – why is this still an open problem?
• Identify the regions of high probability. • Move between these regions. • Move efficiently within each mode. • Adapt the parameters when the algorithm runs. • Use parallel computations.
1 Tempering-based methods
e.g. simulated tempering [Marinari and Parisi (1992)], parallel tem- pering [Geyer (1991)] Problems: • Proven to mix poorly when π has both narrow and wide modes Woodard et al. (2009). • No mechanism of remembering where the modes are. • The same covariance matrix used everywhere.
2 Multimodality: why 1 still an open issue?
Jumping Adaptive 2 Multimodal Sampler (JAMS)
3 Illustrations
4 Ergodicity of JAMS
3 JAMS: (3 key ideas)
Find the modes and then sample!
4 JAMS: (3 key ideas)
Propose within mode and between-mode moves separately.
5 JAMS: (3 key ideas)
Use different covariance matrices in different regions, and learn them while the algorithm runs.
6 Mode-finding
1. Sample n point (from the prior). 2. Run n minimisation procedures for − log(π) in parallel (BFGS – a Quasi Newton method). 3. Merge vectors that correspond to the same local maxima of π (using Hessians).
7 Mode-finding
1. Sample n point (from the prior). 2. Run n minimisation procedures for − log(π) in parallel (BFGS – a Quasi Newton method). 3. Merge vectors that correspond to the same local maxima of π (using Hessians).
7 Mode-finding
1. Sample n point (from the prior). 2. Run n minimisation procedures for − log(π) in parallel (BFGS – a Quasi Newton method). 3. Merge vectors that correspond to the same local maxima of π (using Hessians).
7 Mode-finding
1. Sample n point (from the prior). 2. Run n minimisation procedures for − log(π) in parallel (BFGS – a Quasi Newton method). 3. Merge vectors that correspond to the same local maxima of π (using Hessians).
7 Target distribution π˜
Set of modes: I = {µ1, . . . , µN }. Target distribution on an augmented state space X × I:
wi Qi (µi , Σi )(x) π˜(x, i) = π(x)P , j∈I wj Qj (µj , Σj )(x)
PN for i=1 wi = 1 and Qi (µi , Σi ) – an elliptical distribution, e.g. Gaussian or multivariate t. π(x) is the marginal distribution ofπ ˜ w.r.t. x.
N X wi Qi (µi , Σi )(x) π(x)P = π(x). wj Qj (µj , Σj )(x) i=1 j∈I
8 Target distribution π˜
Set of modes: I = {µ1, . . . , µN }. Target distribution on an augmented state space X × I:
wi Qi (µi , Σi )(x) π˜(x, i) = π(x)P , j∈I wj Qj (µj , Σj )(x)
PN for i=1 wi = 1 and Qi (µi , Σi ) – an elliptical distribution, e.g. Gaussian or multivariate t. π(x) is the marginal distribution ofπ ˜ w.r.t. x.
N X wi Qi (µi , Σi )(x) π(x)P = π(x). wj Qj (µj , Σj )(x) i=1 j∈I
8 Local and global moves
Suppose the current state of the chain is (x, i).
• Local Move (with probability 1 − ε): • preserves the mode ((x, i) → (y, i)) • a Random Walk Metropolis step with the proposal distribution 2 N x, 2.38 /dΣi (or the t distribution). • Jump Move (with probability ε):
• a new mode k is proposed with probability aik ((x, i) → (y, k)) • two ways of proposing y: • deterministic jump: y such that −1 −1 (y − µk )Σk (y − µk ) = (x − µi )Σi (x − µi ). • independence sampler: y ∼ N (µk , Σk ).
The acceptance probabilities are such that the detailed balance holds forπ ˜.
9 Local and global moves
Suppose the current state of the chain is (x, i).
• Local Move (with probability 1 − ε): • preserves the mode ((x, i) → (y, i)) • a Random Walk Metropolis step with the proposal distribution 2 N x, 2.38 /dΣi (or the t distribution). • Jump Move (with probability ε):
• a new mode k is proposed with probability aik ((x, i) → (y, k)) • two ways of proposing y: • deterministic jump: y such that −1 −1 (y − µk )Σk (y − µk ) = (x − µi )Σi (x − µi ). • independence sampler: y ∼ N (µk , Σk ).
The acceptance probabilities are such that the detailed balance holds forπ ˜.
9 Updating the parameters
• Update weights wi so that they represent proportions of samples in each mode observed.
• When a sample (x, i) appears, recalculate Σi .
Are the samples (x, i) collected from the region around µi ?
10 Controlling the modes
Suppose we propose a local move (x, i) → (y, i).
11 Controlling the modes
Suppose we propose a local move (x, i) → (y, i). Acceptance prob- ability: P π˜(y) π(y) Q (µ , Σ )(y) wj Qj (µj , Σj )(x) i i i j∈I min 1, = min 1, P π˜(x) π(x) Qi (µi , Σi )(x) j∈I wj Qj (µj , Σj )(y) | {z } tiny Unlikely to accept (y, i).
11 Illustrations
• Example studied by Woodard et al. (2009). 1 2 1 2 π(x) = N − (1,..., 1), σ1Id + N (1,..., 1), σ2Id , 2 | {z } 2 | {z } d d
2 p 2 p σ1 = 0.5 d/100 and σ2 = d/100 (up to dimension 200).
• Mixture of t-distributions and banana-shaped distributions (up to dimension 80).
12 Sensor network localisation (1)
11 sensors, on [0, 1]2, location of 8 of them unknown.
2 kxi −xj k • xi observed from xj with probability exp − 2×0.32
• The observed distance yij between two sensors follows