Part 3 – Bayesian System Identification
Total Page:16
File Type:pdf, Size:1020Kb
Learning nonlinear dynamics using sequential Monte Carlo Part 3 { Bayesian system identification Thomas Sch¨on,Uppsala University 2020-03-11 Outline { Part 3 Aim: Show how SMC can be used in identifying nonlinear SSMs using Bayesian approach. Outline: 1. Bayesian inference and the MCMC idea 2. The Metropolis Hastings algorithm 3. Background on Bayesian system identification 4. Using unbiased estimates within Metropolis Hastings 5. Exact approximation { Particle Metropolis Hastings (PMH) 6. Outlook (if there is time) 1/43 Bayesian inference { setup for now Bayesian inference comes down to computing the target distribution π(x). More commonly our interest lies in some integral of the form: Eπ['(x) y1:T ] = '(x)p(x y1:T )dx: j j Z Ex. (nonlinear dynamical systems) Here our interest is often x = θ and π(θ) = p(θ y1:T ) j or x = (x1:T ;θ ) and π(x1:T ;θ ) = p(x1:T ;θ y1:T ). j We keep the development general for now and specialize later. 2/43 How? The two main strategies for the Bayesian inference problem: 1. Variational methods provides an approximation by assuming a certain functional form containing unknown parameters, which are found using optimization, where some distance measure is minimized. 2. Markov chain Monte Carlo (MCMC) works by simulating a Markov chain which is designed in such a way that its stationary distribution coincides with the target distribution. 3/43 MCMC and Metropolis Hastings Toy illustration { AR(1) Let us play the game where you are asked to generate samples from π(x) = x 0; 1=(1 0:82) : N − One realisation from x[t + 1]=0 :8x[t]+ v[t] where v[t] (0; 1). ∼ N Initialise in x[0]= 40. − 10 5 0 This will eventually generate −5 samples from the following −10 stationary distribution: x −15 −20 ps(x) = x 0; 1=(1 0:82) −25 N − −30 −35 as t . −40 ! 1 0 100 200 300 400 500 Time 4/43 Toy illustration { AR(1) 0.3 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 1 000 samples 100 000 samples The true stationary distribution is shown in black and the empirical histogram obtained by simulating the Markov chain x[t + 1]=0 :8x[t]+ v[t] is plotted in gray. The initial 1 000 samples are discarded (burn-in). 5/43 Metropolis Hastings algorithm A systematic method for constructing such Markov chains is provided by: 0 1. Sample a candidate x from a proposal (akin to what we did in importance sampling) x 0 q(x x[m]) ∼ j 2. Choose the candidate sample x 0 as the next state of the Markov chain with probability (for intuition: think about the importance weights) π(x 0) q(x[m] x 0) α = min 1; j π(x[m]) q(x 0 x[m]) j Select the new state of the Markov chain according to x 0 w.p.α x[m + 1]= (x[m] w.p.1 α − Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H. and Teller, E. Equations of state calculations by fast computing machine, J. Chem. Phys. 21(6): 1087{1092, 1953. 6/43 Hastings, W.K. Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika. 57(1): 97{109, 1970. Metropolis Hastings algorithm Algorithm 1 Metropolis Hastings (MH) 1. Initialize: Set the initial state of the Markov chain x[1]. 2. For m = 1 to M, iterate: a. Sample x 0 q(x x[m]). ∼ j b. Sample u [0; 1]. ∼ U c. Compute the acceptance probability π(x 0) q(x[m] x 0) α = min 1; j π(x[m]) q(x 0 x[m]) j d. Set the next state x[m + 1] of the Markov chain according to x 0 u α x[m + 1]= ≤ (x[m] otherwise 1 M Resulting empirical approx. of the posterior: π(x) = M m=1 δx[m](x). 7/43 P b Statistical properties of MCMC The MCMC estimator 1 M I ['] = '(x[m]) M m=0 X is by the ergodic theoremb known to be strongly consistent, i.e. M 1 a:s: '(x[m]) '(x)p(x y1:T ) M −−! j m=0 X Z I ['] bI ['] | {z } when M . | {z } ! 1 Central limit theorem (CLT) stating that d pM I ['] I ['] (0; σ2 ) − −! N MCMC when M . ! 1 b 8/43 Using MH for Bayesian inference in dynamical systems Recall the Bayesian problem formulation Bayesian SSM representation using probability distributions xt xt−1;θ p(xt xt−1;θ ); j ∼ j yt xt ;θ p(yt xt ;θ ); j ∼ j x0 p(x0 θ); ∼ j θ p(θ): ∼ Based on our generative model, compute the posterior distribution p(x0:T ;θ y1:T ) = p(x0:T θ; y1:T ) p(θ y1:T ) : j j j state inf. param. inf. | {z } | {z } Bayesian formulation { model the unknown parameters as a random variable θ p(θ) and compute ∼ p(y1:T θ)p(θ) p(y1:T θ)p(θ) p(θ y1:T ) = j = j j p(y1:T ) p(y1:T θ)p(θ)dθ j 9/43 R Using MH for parameter inference in a dynamical system Algorithm 2 Metropolis Hastings (MH) 1. Initialize: Set the initial state of the Markov chain θ[1]. 2. For m = 1 to M, iterate: a. Sample θ0 q(θ θ[m]). ∼ j b. Sample u [0; 1]. ∼ U c. Compute the acceptance probability 0 0 0 p(y1:T θ )p(θ ) q(θ[m] θ ) α = min 1; j 0 j p(y1:T θ[m])p(θ[m]) q(θ θ[m]) j j d. Set the next state θ[m + 1] of the Markov chain according to θ0 u α θ[m + 1]= ≤ (θ[m] otherwise 10/43 Setting up an MH algorithm To be able to use MH we need to 1. decide on a proposal q to use and 2. compute the acceptance probability α. 11/43 Important question Problem: We cannot evaluate the acceptance probability α since the likelihood p(y1:T θ) is intractable. j We know that SMC provides an estimate of the likelihood. Important question: Is it possible to use an estimate of the likelihood in computing the acceptance probability and still end up with a valid algorithm? Valid here means that the method converges in the sense of M 1 a:s: '(θ[m]) '(θ)p(θ y1:T ); when M : M −−! j ! 1 m=1 X Z 12/43 Particle Metropolis Hastings The particle filter as a likelihood estimator Fact: The particle filter provides a • non-negative • and unbiased estimate z of the likelihood p(y1:T θ). j This likelihood estimator z is itself a random variable distributed b according to b z (z θ; y1:T ): ∼ j This is a very complicated distribution,b but importantly we will (as we will see) never be required to evaluate it, only sample from it. 13/43 Auxiliary variables { very useful construction Target distribution: π(x), difficult to sample from Idea: Introduce another variable u with conditional distribution π(u x) j The joint distribution π(x; u) = π(u x)π(x) admits π(x) as a marginal j by construction, i.e., π(x; u)du = π(x): R Sampling from the joint π(x; u) may be easier than directly sampling from the marginal π(x)! The variable u is an auxiliary variable. It may have some \physical" interpretation (an unobserved measurement, unknown temperature, . ) but this is not necessary. 14/43 What about introducing z as an auxiliary variable? b Consider an extended model where z is included as an auxiliary variable (θ; z) (θ; z y1:T ) = (z θ; y1:T )p(θ y1:T ) ∼ j b j j p(y1:T θ)p(θ) (z θ; y1:T ) = j j b p(y1:T ) Importantly we note that the original target distribution p(θ y1:T ) is by j construction obtained my marginalizing (θ; z y1:T ) w.r.t. z. j Key question: If we now were to construct a Metropolis Hastings algorithm for θ and z, have we solved the problem? 15/43 Trick { defining a new extended target distribution Enabling trick: Define a new joint target distribution over (θ; z) by simply replacing p(y1:T θ) with its estimator z. j b b Hence, our new target distribution is given by zp(θ) (z θ; y1:T ) π(θ; z y1:T ) = j j p(y1:T ) Key question: Is this ok? 16/43 Verifying that our new extended target is indeed ok Requirements on π: 1. Non-negative. 2. Integrate to 1. 3. Correct marginal distribution: π(θ; z y1:T )dz = p(θ y1:T ). j j Requirement 1 follows from the non-negativityR of z. b 17/43 What about requirement 2 and 3? Let us start by noting that p(θ) π(θ; z y1:T )dz = z (z θ; y1:T )dz: j p(y ) j Z 1:T Z What can we say about this integral? The fact that the likelihood estimate from the particle filter is unbiased means that z (z θ; y1:T )dz = p(y1:T θ)!! j j Z Hence, we have shown that p(θ) π(θ; z y1:T )dz = p(y1:T θ) = p(θ y1:T ); j p(y ) j j Z 1:T which means that 3 is ok and also 2.