<<

A Framework for Adaptive MCMC Targeting Multimodal Distributions

Emilia Pompe Department of , University of Oxford Palermo, European Meeting of Statisticians 2019 joint work with Chris Holmes (Oxford) and KrysLatuszy´nski(Warwick) Multimodality – why is this still an open problem?

How to sample efficiently from a ?

• Identify the regions of high probability. • Move between these regions. • Move efficiently within each . • Adapt the parameters when the algorithm runs. • Use parallel computations.

1 Multimodality – why is this still an open problem?

How to sample efficiently from a multimodal distribution?

• Identify the regions of high probability. • Move between these regions. • Move efficiently within each mode. • Adapt the parameters when the algorithm runs. • Use parallel computations.

1 Multimodality – why is this still an open problem?

How to sample efficiently from a multimodal distribution?

• Identify the regions of high probability. • Move between these regions. • Move efficiently within each mode. • Adapt the parameters when the algorithm runs. • Use parallel computations.

1 Multimodality – why is this still an open problem?

• Identify the regions of high probability. • Move between these regions. • Move efficiently within each mode. • Adapt the parameters when the algorithm runs. • Use parallel computations.

1 Tempering-based methods

e.g. simulated tempering [Marinari and Parisi (1992)], parallel tem- pering [Geyer (1991)] Problems: • Proven to mix poorly when π has both narrow and wide modes Woodard et al. (2009). • No mechanism of remembering where the modes are. • The same covariance matrix used everywhere.

2 Multimodality: why 1 still an open issue?

Jumping Adaptive 2 Multimodal Sampler (JAMS)

3 Illustrations

4 Ergodicity of JAMS

3 JAMS: (3 key ideas)

Find the modes and then sample!

4 JAMS: (3 key ideas)

Propose within mode and between-mode moves separately.

5 JAMS: (3 key ideas)

Use different covariance matrices in different regions, and learn them while the algorithm runs.

6 Mode-finding

1. Sample n point (from the prior). 2. Run n minimisation procedures for − log(π) in parallel (BFGS – a Quasi Newton method). 3. Merge vectors that correspond to the same local maxima of π (using Hessians).

7 Mode-finding

1. Sample n point (from the prior). 2. Run n minimisation procedures for − log(π) in parallel (BFGS – a Quasi Newton method). 3. Merge vectors that correspond to the same local maxima of π (using Hessians).

7 Mode-finding

1. Sample n point (from the prior). 2. Run n minimisation procedures for − log(π) in parallel (BFGS – a Quasi Newton method). 3. Merge vectors that correspond to the same local maxima of π (using Hessians).

7 Mode-finding

1. Sample n point (from the prior). 2. Run n minimisation procedures for − log(π) in parallel (BFGS – a Quasi Newton method). 3. Merge vectors that correspond to the same local maxima of π (using Hessians).

7 Target distribution π˜

Set of modes: I = {µ1, . . . , µN }. Target distribution on an augmented state space X × I:

wi Qi (µi , Σi )(x) π˜(x, i) = π(x)P , j∈I wj Qj (µj , Σj )(x)

PN for i=1 wi = 1 and Qi (µi , Σi ) – an , e.g. Gaussian or multivariate t. π(x) is the marginal distribution ofπ ˜ w.r.t. x.

N X wi Qi (µi , Σi )(x) π(x)P = π(x). wj Qj (µj , Σj )(x) i=1 j∈I

8 Target distribution π˜

Set of modes: I = {µ1, . . . , µN }. Target distribution on an augmented state space X × I:

wi Qi (µi , Σi )(x) π˜(x, i) = π(x)P , j∈I wj Qj (µj , Σj )(x)

PN for i=1 wi = 1 and Qi (µi , Σi ) – an elliptical distribution, e.g. Gaussian or multivariate t. π(x) is the marginal distribution ofπ ˜ w.r.t. x.

N X wi Qi (µi , Σi )(x) π(x)P = π(x). wj Qj (µj , Σj )(x) i=1 j∈I

8 Local and global moves

Suppose the current state of the chain is (x, i).

• Local Move (with probability 1 − ε): • preserves the mode ((x, i) → (y, i)) • a Random Walk Metropolis step with the proposal distribution 2  N x, 2.38 /dΣi (or the t distribution). • Jump Move (with probability ε):

• a new mode k is proposed with probability aik ((x, i) → (y, k)) • two ways of proposing y: • deterministic jump: y such that −1 −1 (y − µk )Σk (y − µk ) = (x − µi )Σi (x − µi ). • independence sampler: y ∼ N (µk , Σk ).

The acceptance probabilities are such that the detailed balance holds forπ ˜.

9 Local and global moves

Suppose the current state of the chain is (x, i).

• Local Move (with probability 1 − ε): • preserves the mode ((x, i) → (y, i)) • a Random Walk Metropolis step with the proposal distribution 2  N x, 2.38 /dΣi (or the t distribution). • Jump Move (with probability ε):

• a new mode k is proposed with probability aik ((x, i) → (y, k)) • two ways of proposing y: • deterministic jump: y such that −1 −1 (y − µk )Σk (y − µk ) = (x − µi )Σi (x − µi ). • independence sampler: y ∼ N (µk , Σk ).

The acceptance probabilities are such that the detailed balance holds forπ ˜.

9 Updating the parameters

• Update weights wi so that they represent proportions of samples in each mode observed.

• When a sample (x, i) appears, recalculate Σi .

Are the samples (x, i) collected from the region around µi ?

10 Controlling the modes

Suppose we propose a local move (x, i) → (y, i).

11 Controlling the modes

Suppose we propose a local move (x, i) → (y, i). Acceptance prob- ability:   P  π˜(y)  π(y) Q (µ , Σ )(y) wj Qj (µj , Σj )(x)  i i i j∈I  min 1, = min 1, P  π˜(x)  π(x) Qi (µi , Σi )(x) j∈I wj Qj (µj , Σj )(y) | {z } tiny Unlikely to accept (y, i).

11 Illustrations

• Example studied by Woodard et al. (2009).     1 2 1 2 π(x) = N − (1,..., 1), σ1Id  + N (1,..., 1), σ2Id  , 2 | {z } 2 | {z } d d

2 p 2 p σ1 = 0.5 d/100 and σ2 = d/100 (up to dimension 200).

• Mixture of t-distributions and banana-shaped distributions (up to dimension 80).

12 Sensor network localisation (1)

11 sensors, on [0, 1]2, location of 8 of them unknown.

2  kxi −xj k  • xi observed from xj with probability exp − 2×0.32

• The observed distance yij between two sensors follows

2 yij wij = 1 ∼ N kxi − xj k, 0.02 .

• Uniform prior on x1,..., x8.

JAMS versus Adaptive Parallel Tempering (APT) [Miasojedow et al. (2013)]? JAMS: 500,000 iterations, 10,000 mode finding runs APT: 500,000 × 4 temperatures

13 Sensor network localisation (2)

14 Sensor network localisation (2)

More simulations in arXiv:1812.02609. 14 Ergodicity of the algorithm

wi Qi (µi , Σi )(x) π˜(x, i) = π(x)P , j∈I wj Qj (µj , Σj )(x)

15 Ergodicity of the algorithm

wγ,i Qi (µi , Σγ,i )(x) π˜γ(x, i) = π(x)P , j∈I wγ,j Qj (µj , Σγ,j )(x)

NOT an adaptive MCMC algorithm

15 Multimodality: why 1 still an open issue?

Jumping Adaptive 2 Multimodal Sampler (JAMS)

3 Illustrations

4 Ergodicity of JAMS

16 Auxiliary Variable Adaptive MCMC class

Augmented state space X˜ = X × Φ.

˜n ˜ 1. limn→∞ kPγ (˜x, ·) − π˜γ(·)kTV = 0 for all x˜ := (x, φ) ∈ X .

2. Distributionsπ ˜γ such that: π˜γ(B × Φ) = π(B) for every B ∈ B(X ) and γ ∈ Y.

Convergence

 ˜ ˜  sup P Xn ∈ (B × Φ) X0 =x ˜, Γ0 = γ − π(B) → 0 as n → ∞. B∈B(X )

17 Auxiliary Variable Adaptive MCMC class

Augmented state space X˜ = X × Φ.

˜n ˜ 1. limn→∞ kPγ (˜x, ·) − π˜γ(·)kTV = 0 for all x˜ := (x, φ) ∈ X .

2. Distributionsπ ˜γ such that: π˜γ(B × Φ) = π(B) for every B ∈ B(X ) and γ ∈ Y.

Convergence

 ˜ ˜  sup P Xn ∈ (B × Φ) X0 =x ˜, Γ0 = γ − π(B) → 0 as n → ∞. B∈B(X )

17 Ergodicity of Adaptive MCMC

(a) (Diminishing adaptation.) The random variable

Dn := sup kPΓn+1 (x, ·) − PΓn (x, ·)kTV x∈X converges to 0 in probability. (b) (Containment.) For all ε > 0 and all δ > 0, there exists N = N(ε, δ) such that

P (Mε(Xn, Γn) > N|X0 = x, Γ0 = γ) ≤ δ

for all n ∈ N, where

k Mε(x, γ) = inf{k ≥ 1 : kPγ (x, ·) − π(·)kTV ≤ ε}.

18 Ergodicity of Adaptive MCMC

(a) (Diminishing adaptation.) The random variable

Dn := sup kPΓn+1 (x, ·) − PΓn (x, ·)kTV x∈X converges to 0 in probability. (b) (Containment.) For all ε > 0 and all δ > 0, there exists N = N(ε, δ) such that

P (Mε(Xn, Γn) > N|X0 = x, Γ0 = γ) ≤ δ

for all n ∈ N, where

k Mε(x, γ) = inf{k ≥ 1 : kPγ (x, ·) − π(·)kTV ≤ ε}.

18 Ergodicity of Adaptive MCMC [Roberts and Rosenthal (2007)] Under the containment and diminishing adaptation conditions, adaptive MCMC algorithms are ergodic.

Ergodicity of Auxiliary Variable Adaptive MCMC [P. et al. (2018)] Under the containment condition

 ˜ ˜  P Mε(Xn, Γn) > N|X0 =x ˜, Γ0 = γ ≤ δ for all n ∈ N,

˜k where Mε(˜x, γ) = inf{k ≥ 1 : kPγ(˜x, ·) − π˜γ(·)kTV } ≤ ε, and diminishing adaptation,

˜ ˜ sup kPΓn+1 (˜x, ·) − PΓn (˜x, ·)kTV → 0, x˜∈X˜ Auxiliary Variable adaptive MCMC algorithms are ergodic. 19 WLLN for Auxiliary Variable Adaptive MCMC Under containment and diminishing adaptation the Weak Law of Large Numbers holds for Auxiliary Variable adaptive MCMC algorithms for bounded functions g, i.e.

n 1 X g(X ) → π(g) in probability. n i i=1

20 Diminishing adaptation + containment

Diminishing adaptation: adapt increasingly rarely (both classes of algorithms) [Chimisov et al. (2018)].

Containment:

1. Drift condition: for some 0 < λ < 1 and b > 0.

PγV (x) ≤λV (x) + b for all x ∈ X and γ ∈ Y;

where PγV (x) =Eγ (V (Xn+1)|Xn) , ˜ ˜ PγVπ˜γ (˜x) ≤λVπ˜γ (˜x)+ b for all˜x ∈ X and γ ∈ Y. 2. Minorisation condition: for some δ > 0 and v > 0.

Pγ(x, ·) ≥δν(·) for V (x) < v; ˜ Pγ(˜x, ·) ≥δνγ(·) for Vπ˜γ (˜x) < v.

21 Containment for Adaptive MCMC If the drift condition and the minorisation condition are satisfied, then containment holds.

Containment for Auxiliary Variable Adaptive MCMC If the drift condition and the minorisation condition are satisfied, and if the space of parameters Y is compact ( + technical conditions), then containment holds.

−1/2 −1/2 Typical choice: V := π or Vπ˜γ :=π ˜γ .

22 Ergodicity of JAMS

For π super-exponential:

1. For the independent jumps, if the tails of the proposal for jumps are heavier than the tails of π, JAMS is ergodic.

2. For the deterministic jumps, if Qi has polynomial tails, JAMS is ergodic.

+ LLN for bounded functions

23 A quick recap of JAMS1

Analogous results to Adaptive MCMC Auxiliary Variable Analysis of other Adaptive MCMC algorithms in Find the modes and then sample this class

Make local moves Jumping and jumps Adaptive separately. 3 key ideas Ergodicity of Multimodal JAMS Sampler (JAMS)

Adapt the parameters on the fly, separately for each region.

1Pompe, E., Holmes, C., andLatuszy´nski,K. (2018). A Framework for Adaptive MCMC Targeting Multimodal Distributions. arXiv preprint arXiv:1812.02609.

24 Chimisov, C.,Latuszy´nski, K., and Roberts, G. (2018). Air Markov Chain Monte Carlo. arXiv preprint arXiv:1801.09309. Geyer, C. J. (1991). Markov chain Monte Carlo maximum likelihood. Marinari, E. and Parisi, G. (1992). Simulated tempering: a new Monte Carlo scheme. EPL (Europhysics Letters), 19(6):451. Miasojedow, B., Moulines, E., and Vihola, M. (2013). An adaptive parallel tempering algorithm. Journal of Compu- tational and Graphical Statistics, 22(3):649–664. Roberts, G. and Rosenthal, J. (2007). Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithms. Journal of Applied Probability, 44(2):458. Woodard, D., Schmidler, S., and Huber, M. (2009). Sufficient conditions for torpid mixing of parallel and simulated tempering. Electronic Journal of Probability, 14:780–804.

25