Part 3 – Bayesian System Identiﬁcation

Learning nonlinear dynamics using sequential Monte Carlo Part 3 { Bayesian system identification Thomas Schön,Uppsala University 2020-03-11 Outline { Part 3 Aim: Show how SMC can be used in identifying nonlinear SSMs using Bayesian approach. Outline: 1. Bayesian inference and the MCMC idea 2. The Metropolis Hastings algorithm 3. Background on Bayesian system identification 4. Using unbiased estimates within Metropolis Hastings 5. Exact approximation { Particle Metropolis Hastings (PMH) 6. Outlook (if there is time) 1/43 Bayesian inference { setup for now Bayesian inference comes down to computing the target distribution π(x). More commonly our interest lies in some integral of the form: Eπ['(x) y1:T ] = '(x)p(x y1:T )dx: j j Z Ex. (nonlinear dynamical systems) Here our interest is often x = θ and π(θ) = p(θ y1:T ) j or x = (x1:T ;θ ) and π(x1:T ;θ ) = p(x1:T ;θ y1:T ). j We keep the development general for now and specialize later. 2/43 How? The two main strategies for the Bayesian inference problem: 1. Variational methods provides an approximation by assuming a certain functional form containing unknown parameters, which are found using optimization, where some distance measure is minimized. 2. Markov chain Monte Carlo (MCMC) works by simulating a Markov chain which is designed in such a way that its stationary distribution coincides with the target distribution. 3/43 MCMC and Metropolis Hastings Toy illustration { AR(1) Let us play the game where you are asked to generate samples from π(x) = x 0; 1=(1 0:82) : N − One realisation from x[t + 1]=0 :8x[t]+ v[t] where v[t] (0; 1). ∼ N Initialise in x[0]= 40. − 10 5 0 This will eventually generate −5 samples from the following −10 stationary distribution: x −15 −20 ps(x) = x 0; 1=(1 0:82) −25 N − −30 −35 as t . −40 ! 1 0 100 200 300 400 500 Time 4/43 Toy illustration { AR(1) 0.3 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 1 000 samples 100 000 samples The true stationary distribution is shown in black and the empirical histogram obtained by simulating the Markov chain x[t + 1]=0 :8x[t]+ v[t] is plotted in gray. The initial 1 000 samples are discarded (burn-in). 5/43 Metropolis Hastings algorithm A systematic method for constructing such Markov chains is provided by: 0 1. Sample a candidate x from a proposal (akin to what we did in importance sampling) x 0 q(x x[m]) ∼ j 2. Choose the candidate sample x 0 as the next state of the Markov chain with probability (for intuition: think about the importance weights) π(x 0) q(x[m] x 0) α = min 1; j π(x[m]) q(x 0 x[m]) j Select the new state of the Markov chain according to x 0 w.p.α x[m + 1]= (x[m] w.p.1 α − Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H. and Teller, E. Equations of state calculations by fast computing machine, J. Chem. Phys. 21(6): 1087{1092, 1953. 6/43 Hastings, W.K. Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika. 57(1): 97{109, 1970. Metropolis Hastings algorithm Algorithm 1 Metropolis Hastings (MH) 1. Initialize: Set the initial state of the Markov chain x[1]. 2. For m = 1 to M, iterate: a. Sample x 0 q(x x[m]). ∼ j b. Sample u [0; 1]. ∼ U c. Compute the acceptance probability π(x 0) q(x[m] x 0) α = min 1; j π(x[m]) q(x 0 x[m]) j d. Set the next state x[m + 1] of the Markov chain according to x 0 u α x[m + 1]= ≤ (x[m] otherwise 1 M Resulting empirical approx. of the posterior: π(x) = M m=1 δx[m](x). 7/43 P b Statistical properties of MCMC The MCMC estimator 1 M I ['] = '(x[m]) M m=0 X is by the ergodic theoremb known to be strongly consistent, i.e. M 1 a:s: '(x[m]) '(x)p(x y1:T ) M −−! j m=0 X Z I ['] bI ['] | {z } when M . | {z } ! 1 Central limit theorem (CLT) stating that d pM I ['] I ['] (0; σ2 ) − −! N MCMC when M . ! 1 b 8/43 Using MH for Bayesian inference in dynamical systems Recall the Bayesian problem formulation Bayesian SSM representation using probability distributions xt xt−1;θ p(xt xt−1;θ ); j ∼ j yt xt ;θ p(yt xt ;θ ); j ∼ j x0 p(x0 θ); ∼ j θ p(θ): ∼ Based on our generative model, compute the posterior distribution p(x0:T ;θ y1:T ) = p(x0:T θ; y1:T ) p(θ y1:T ) : j j j state inf. param. inf. | {z } | {z } Bayesian formulation { model the unknown parameters as a random variable θ p(θ) and compute ∼ p(y1:T θ)p(θ) p(y1:T θ)p(θ) p(θ y1:T ) = j = j j p(y1:T ) p(y1:T θ)p(θ)dθ j 9/43 R Using MH for parameter inference in a dynamical system Algorithm 2 Metropolis Hastings (MH) 1. Initialize: Set the initial state of the Markov chain θ[1]. 2. For m = 1 to M, iterate: a. Sample θ0 q(θ θ[m]). ∼ j b. Sample u [0; 1]. ∼ U c. Compute the acceptance probability 0 0 0 p(y1:T θ )p(θ ) q(θ[m] θ ) α = min 1; j 0 j p(y1:T θ[m])p(θ[m]) q(θ θ[m]) j j d. Set the next state θ[m + 1] of the Markov chain according to θ0 u α θ[m + 1]= ≤ (θ[m] otherwise 10/43 Setting up an MH algorithm To be able to use MH we need to 1. decide on a proposal q to use and 2. compute the acceptance probability α. 11/43 Important question Problem: We cannot evaluate the acceptance probability α since the likelihood p(y1:T θ) is intractable. j We know that SMC provides an estimate of the likelihood. Important question: Is it possible to use an estimate of the likelihood in computing the acceptance probability and still end up with a valid algorithm? Valid here means that the method converges in the sense of M 1 a:s: '(θ[m]) '(θ)p(θ y1:T ); when M : M −−! j ! 1 m=1 X Z 12/43 Particle Metropolis Hastings The particle filter as a likelihood estimator Fact: The particle filter provides a • non-negative • and unbiased estimate z of the likelihood p(y1:T θ). j This likelihood estimator z is itself a random variable distributed b according to b z (z θ; y1:T ): ∼ j This is a very complicated distribution,b but importantly we will (as we will see) never be required to evaluate it, only sample from it. 13/43 Auxiliary variables { very useful construction Target distribution: π(x), difficult to sample from Idea: Introduce another variable u with conditional distribution π(u x) j The joint distribution π(x; u) = π(u x)π(x) admits π(x) as a marginal j by construction, i.e., π(x; u)du = π(x): R Sampling from the joint π(x; u) may be easier than directly sampling from the marginal π(x)! The variable u is an auxiliary variable. It may have some \physical" interpretation (an unobserved measurement, unknown temperature, . ) but this is not necessary. 14/43 What about introducing z as an auxiliary variable? b Consider an extended model where z is included as an auxiliary variable (θ; z) (θ; z y1:T ) = (z θ; y1:T )p(θ y1:T ) ∼ j b j j p(y1:T θ)p(θ) (z θ; y1:T ) = j j b p(y1:T ) Importantly we note that the original target distribution p(θ y1:T ) is by j construction obtained my marginalizing (θ; z y1:T ) w.r.t. z. j Key question: If we now were to construct a Metropolis Hastings algorithm for θ and z, have we solved the problem? 15/43 Trick { defining a new extended target distribution Enabling trick: Define a new joint target distribution over (θ; z) by simply replacing p(y1:T θ) with its estimator z. j b b Hence, our new target distribution is given by zp(θ) (z θ; y1:T ) π(θ; z y1:T ) = j j p(y1:T ) Key question: Is this ok? 16/43 Verifying that our new extended target is indeed ok Requirements on π: 1. Non-negative. 2. Integrate to 1. 3. Correct marginal distribution: π(θ; z y1:T )dz = p(θ y1:T ). j j Requirement 1 follows from the non-negativityR of z. b 17/43 What about requirement 2 and 3? Let us start by noting that p(θ) π(θ; z y1:T )dz = z (z θ; y1:T )dz: j p(y ) j Z 1:T Z What can we say about this integral? The fact that the likelihood estimate from the particle filter is unbiased means that z (z θ; y1:T )dz = p(y1:T θ)!! j j Z Hence, we have shown that p(θ) π(θ; z y1:T )dz = p(y1:T θ) = p(θ y1:T ); j p(y ) j j Z 1:T which means that 3 is ok and also 2.

Part 3 – Bayesian System Identiﬁcation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support