Sequential Monte Carlo methods for identification (a talk about strategies) “The particle filter provides a systematic way of exploring the state space”

Thomas Sch¨on Division of and Control Department of Information Technology Uppsala University.

Email: [email protected], www: user.it.uu.se/~thosc112

Joint work with: Fredrik Lindsten (University of Cambridge), Johan Dahlin (Link¨opingUniversity), Johan W˚agberg (Uppsala University), Christian A. Naesseth (Link¨opingUniversity), Andreas Svensson (Uppsala University), and Liang Dai (Uppsala University). [email protected] IFAC Symposium on System Identification (SYSID), Beijing, China, October 20, 2015. Introduction

A state space model (SSM) consists of a Markov process x ≥ { t}t 1 that is indirectly observed via a measurement process y ≥ , { t}t 1 x x f (x x , u ), x = a (x , u ) + v , t+1 | t ∼ θ t+1 | t t t+1 θ t t θ,t y x g (y x , u ), y = c (x , u ) + e , t | t ∼ θ t | t t t θ t t θ,t x µ (x ), x µ (x ), 1 ∼ θ 1 1 ∼ θ 1 (θ π(θ)). (θ π(θ)). ∼ ∼

Identifying the nonlinear SSM: Find θ based on y y , y , . . . , y (and u ). Hence, the off-line problem. 1:T , { 1 2 T } 1:T One of the key challenges: The states x1:T are unknown.

Aim of the talk: Reveal the structure of the system identification problem arising in nonlinear SSMs and highlight where SMC is used.

1 / 14 [email protected] IFAC Symposium on System Identification (SYSID), Beijing, China, October 20, 2015. Two commonly used problem formulations

Maximum likelihood (ML) formulation – model the unknown pa- rameters as a deterministic variable and solve

θML = arg max pθ(y1:T ). θ∈Θ b

Bayesian formulation – model the unknown as a ran- dom variable θ π(θ) and compute ∼

p(y1:T θ)π(θ) pθ(y1:T )π(θ) p(θ y1:T ) = | = . | p(y1:T ) p(y1:T )

The combination of ML and Bayes is probably more interesting than we think.

2 / 14 [email protected] IFAC Symposium on System Identification (SYSID), Beijing, China, October 20, 2015. Central object – the likelihood

The likelihood is computed by marginalizing the joint density

T T −1 p (x , y ) = µ (x ) g (y x ) f (x x ), θ 1:T 1:T θ 1 θ t | t θ t+1 | t t=1 t=1 Y Y w.r.t. the state sequence x1:T ,

pθ(y1:T ) = pθ(x1:T , y1:T )dx1:T . Z We are averaging pθ(x1:T , y1:T ) over all possible state sequences.

Equivalently we have

T T p (y ) = p (y y − ) = g (y x ) p (x y − ) dx . θ 1:T θ t | 1:t 1 θ t | t θ t | 1:t 1 t t=1 t=1 Y Y Z key challenge

3 / 14 [email protected] IFAC Symposium on System Identification (SYSID),| Beijing,{z China, October} 20, 2015. Sequential Monte Carlo

The need for computational methods, such as SMC, is tightly coupled to the intractability of the integrals on the previous slide.

SMC offers numerical approximations to state estimation problems. The particle filter and the particle smoother maintain empirical approximations

N N i i pθ(xt y1:t) = w δ i (xt), pθ(x1:t y1:t) = w δ i (x1:t). | t xt | t x1:t Xi=1 Xi=1 Convergeb to the true distributionsb as N . −→ ∞

4 / 14 [email protected] IFAC Symposium on System Identification (SYSID), Beijing, China, October 20, 2015. The use of SMC in nonlinear sysid

SMC can be used to approximately

1. Compute the likelihood and its derivatives. 2. Solve state smoothing problems, e.g. compute p(x y ). 1:T | 1:T 3. Simulate from the smoothing pdf, x p(x y ). 1:T ∼ 1:T | 1:T e These three capabilities are key components in realizing various nonlinear system identification strategies.

5 / 14 [email protected] IFAC Symposium on System Identification (SYSID), Beijing, China, October 20, 2015. Identification strategies – overview

Marginalisation Deal with x1:T by marginalizing (integrating) them out and view θ as the only unknown quantity. Frequentistic formulation: Prediction Error Method (PEM) • and direct maximization of the likelihood. Bayesian formulation: the Metropolis Hastings sampler. •

Data augmentation Deal with x1:T by treating them as auxiliary variables to be estimated along with θ. Frequentistic formulation: Expectation Maximization (EM) • algorithm. Bayesian formulation: the Gibbs sampler. •

In marginalization we integrate over x1:T . In augmentation we simulate x1:T . 6 / 14 [email protected] IFAC Symposium on System Identification (SYSID), Beijing, China, October 20, 2015. Marginalization – Metropolis Hastings

Metropolis Hastings (MH) is an MCMC method that produce a sequence of random variables θ[m] by iterating { }m≥1 1. Propose a new sample θ0 q( θ[m]). 2. Accept the new sample with∼ probability· |

0 0 pθ0 (y1:T )π(θ ) q(θ[m] θ ) α = min 1, 0 | p (y1:T )π(θ[m]) q(θ θ[m])  θ[m] | 

The above procedure results in a Markov chain θ[m] { }m≥1 with p(θ y ) as its stationary distribution! | 1:T

SMC used to approximate the likelihood pθ(y1:T ).

7 / 14 [email protected] IFAC Symposium on System Identification (SYSID), Beijing, China, October 20, 2015. Exact approximations and PMCMC

Exact approximation: p(θ y1:T ) is recovered exactly, despite the fact that we employ an SMC| approximation of the likelihood using a finite number of particles N.

This is one of the members in the particle MCMC (PMCMC) family introduced by

Christophe Andrieu, Arnaud Doucet and Roman Holenstein, Particle Markov chain Monte Carlo methods, Journal of the Royal Statistical Society: Series B, 72:269-342, 2010.

The idea underlying PMCMC is to make use of SMC algorithms to propose (simulate) state trajectories x1:T . These state trajectories are then used within standard MCMC algorithms. 1. Particle Metropolis Hastings 2. Particle Gibbs

8 / 14 [email protected] IFAC Symposium on System Identification (SYSID), Beijing, China, October 20, 2015. Data augmentation – Gibbs

Intuitively the data augmentation strategy amounts to iterating between updating x1:T and θ. Gibbs sampling amounts to sequentially sampling from conditionals of the target distribution p(θ,x y ). A (blocked) example: 1:T | 1:T Draw θ[m] p(θ x [m 1], y ); OK! • ∼ | 1:T − 1:T Draw x [m] p(x θ[m], y ). Hard! • 1:T ∼ 1:T | 1:T

Problem: p(x θ[m], y ) not available! 1:T | 1:T

SMC is used to simulate from the smoothing pdf p(x y ). 1:T | 1:T

9 / 14 [email protected] IFAC Symposium on System Identification (SYSID), Beijing, China, October 20, 2015. Data augmentation – EM

The expectation maximization algorithm is an iterative approach to compute ML estimates of unknown parameters (θ) in probabilistic models involving latent variables (the state trajectory x1:T ). EM works by iteratively computing

(θ,θ ) = log p (x , y )p (x y )dx Q k θ 1:T 1:T θk 1:T | 1:T 1:T Z and then maximizing (θ,θ ) w.r.t. θ. Q k Problem: The E-step requires us to solve a smoothing problem, i.e. to compute an expectation under p (x y ). θk 1:T | 1:T

EM MCEM PSEM

SMC is used to approximate the smoothing pdf p (x y ). θk 1:T | 1:T 10 / 14 [email protected] IFAC Symposium on System Identification (SYSID), Beijing, China, October 20, 2015. Combined ML and Bayesian approach

EM MCEM PSEM

SAEM

Markov SAEM PSAEM

In EM (SAEM) (θ,θ ) is replaced by Q k

(θ) = (1 γ ) − (θ) + γ log p (x [k], y ), Qk − k Qk 1 k θ 1:T 1:T where x [k] denotes a sample from p (x y ). 1:bT b θk 1:T | 1:T

F. Lindsten, An efficient stochastic approximation EM algorithm using conditional particle filters. In Proceedings of the Int. Conference on Acoustics, Speech, and (ICASSP), Vancouver, Canada, May 2013.

11 / 14 [email protected] IFAC Symposium on System Identification (SYSID), Beijing, China, October 20, 2015. Example – regularization in nonlinear models

Regularization allows us to tune the model complexity.

TABLE I True function PlaceRESULTS a Gaussian FOR THE HAMMERSTEIN process-WIENER BENCHMARK of 푤t Identified function state spaceExperiment with model푇 = 2 000 (GP-SSM) Estimated standard deviation of 푤t error 0.0005 V prior onStandard the deviation nonlinear of simulation error 0.020 V RMS simulation error 0.020 V 2 m = 6 Run time 13 min

0 transition f( )!

2 · − ) 푡

x Toy example: ( IV. NUMERICALEXAMPLES f 2

= m = 100

0 We demonstrate our proposed method on a series of numer-

+1 10x

푡 t 2 ical examples. The source code is available via the web site x − xt+1 = − + wt, of the first author. 2 1 + 3xt 2 m = 100 A. Simulated example

0 + regularization yt = xt + et.

2 As a first simple numerical example, consider an au- − tonomous system (i.e., no 푢푡) defined by Data 10푥푡 푥푡+1 = − 2 + 푤푡, 푦푡 = 푥푡 + 푒푡, (10) Key: PSAEM1 + 3푥푡 3 2 1 0 1 2 3 − − − x where 푤푡 (0, 0.1) and 푒푡 (0, 0.5). We identify 푓( ) 푡 ∼ 풩 ∼ 풩 · Fig. 1. The first example, with three different settings: 푚 = 6 basis and 푄 from 푇 = 1 000 simulated measurements 푦1:푇 , while functionsAndreas (top), Svensson,푚 = 100 Thomasbasis functions B. Sch¨on,Arno (middle) and Solin푚 and= 100 Simobasis S¨arkk¨a.assumingNonlinear푔( ) stateand 푅 spaceto bemodel known. identification We consider three different functionsusing a with regularized regularization basis (bottom). function The model expansion with 푚.= In 6Proceedingsis not flexible of the IEEE International· Workshop on settings with 푚 = 6 basis functions, 푚 = 100 basis functions enoughComputational to describe the Advances ‘steep’ part in ofMulti-Sensor푓, but results Adaptive in a sensible, Processing albeit not (CAMSAP), Cancun, Mexico, December 2015. perfect, model. The second model is very flexible with its 101 parameters, and and 푚 = 100 basis functions with regularization, respectively, 12 / 14 [email protected] a typical case of over-fittingIFAC to the dataSymposium points (cf. on the System distribution Identificationall using the (SYSID), Fourier Beijing, basis. To China, encode October the a priori 20, 2015. assumption of the data at the very bottom), causing numerical problems and a useless of 푓( ) being a smooth function, we choose the regularization model. The regularization in the third case is a clear remedy to this problem, · still maintaining the high flexibility of the model. as a Gaussian prior of 푤푘 with standard deviation inversely proportional to 푘. The results are shown in Figure 1, where the over-fitting problem for 푚 = 100, and how regularization A natural question is indeed how to choose the prior preci- helps, is apparent. sion 푃 . As stated by [12], the optimal choice (in terms of mean 1 (1) (푚) T (1) (푚) square error) is 푃opt− = E [휔 휔 ] [휔 휔 ] , B. Hammerstein-Wiener benchmark ··· ··· if we think of 휔(1), . . . , 휔(푚) as being random variables. [︀ ]︀ To illustrate how to adapt our approach to problems with As an example, with the natural assumption of 푓 ( ) being 푥 · a given structure, we apply it to the real-data Hammerstein- smooth, the diagonal elements of 푃 should be larger with Wiener system identification benchmark by [20]. We will use increasing order of the Fourier basis functions. The special a subset with 2 000 data points from the original data set for case of assuming 푓 ( ) to be a sample from a Gaussian process 푥 · estimation. Based on the domain knowledge provided by [20] is addressed by [14]. (two third order linear systems in a cascade with a static 퐿1 Other regularization schemes, such as , are possible but nonlinearity between), we identify a model with the structure will not result in closed-form expressions such as (9). 푥1 1 푡+1 푥푡 2 2 C. Computational aspects 푥푡+1 = 퐴1 푥푡 + 퐵푢푡, (11a) [︃ 푥3 ]︃ [︃ 푥3 ]︃ Let 푁 denote the number of particles in the CPF-AS, 푚 the 푡+1 푡 푥4 4 (푘) (푘) 3 푡+1 푥푡 Σ푘휔 휑 (푥푡 ) numer of terms used in the basis function expansion, 푇 the 5 5 푥푡+1 = 퐴2 푥푡 + 0 , (11b) number of data points and 퐾 the numer of iterations used in 6 6 [︃ 푥푡+1 ]︃ [︃ 푥푡 ]︃ [︃ 0 ]︃ Algorithm 2. The computational load is then (푚푇 퐾푁) + 푦 4 5 6 풪 푡 = 퐶 [ 푥푡 푥푡 푥푡 ] , (11c) (푚3). In practice, 푁 and 푚 can be chosen fairly small (e.g., 풪 푁 = 5 and 푚 = 10 for a 1D model). where the superindex on the state denotes a particular com- ponent of the state vector. Furthermore, we have omitted D. Convergence all noise terms for notational brevity. There is only one The convergence properties of PSAEM are not yet fully nonlinear function, but the linear parts can be seen as the understood, but it can under certain assumptions be shown special case where 휑(푘)(푥) 푚 = 푥 , which can directly { }푘=1 { } to converge to a stationary point of 푝휃(푢1:푇 , 푦1:푇 ) by [17, be incorporated into the presented framework. Theorem 1]. We have not experienced practical problems with We present the results in Table I (all metrics are with respect the convergence, although it is sensitive to initialization when to the evaluation data from the original data set). We refer to the dimension of 휃 is large (e.g., 1 000 parameters). [21] for a thorough evaluation of alternative methods. Conclusion

Marginalization Data augmentation ML Direct optimization Expectation Maximization Bayesian Metropolis Hastings Gibbs sampling

SMC is used to realize all of these approaches for nonlinear SSMs. SMC can be used to approximately 1. Compute the likelihood and its derivatives. 2. Solve state smoothing problems, e.g. compute p(x y ). 1:T | 1:T 3. Simulate from the smoothing pdf, x p(x y ). 1:T ∼ 1:T | 1:T e 1 day tutorial at ICASSP (Shanghai) in March next year.

13 / 14 [email protected] IFAC Symposium on System Identification (SYSID), Beijing, China, October 20, 2015. Open tenure track position in ML

If you are interested, send an e-mail to Thomas Sch¨on [email protected] If you know someone else who might be interested feel free to help us spread this information as much as you can and want. Deadline for applications: November 30, 2015. Details here: www.uu.se/en/about-uu/join-us/details/?positionId=75817

2. I am also looking for new PhD students. Feel free to spread this information as well!! Topic: Nonlinear inference in system identification and machine learning. Deadline for applications: December 20, 2015. www.uu.se/en/about-uu/join-us/details/?positionId=78032

14 / 14 [email protected] IFAC Symposium on System Identification (SYSID), Beijing, China, October 20, 2015. SMC convergence in one slide...

Let ϕ : X R be some test function of interest. The expectation 7→ E [ϕ(x ) y ] = ϕ(x )p (x y )dx , θ t | 1:t t θ t | 1:t t Z can be estimated by the particle filter N N i i ϕt , wtϕ(xt). Xi=1 The CLT governing theb convergence of this states √N ϕN E [ϕ(x ) y ] d (0, σ2(ϕ)). t − θ t | 1:t −→ N t  b t 1 N i The likelihood estimate pθ(y1:t) = s=1 N i=1 w¯s from the PF is unbiased, E [p (y )] = p (y ) for any value of N and ψθ θ 1:t θ Q1:t n P o there are CLTs available bas well. b 15 / 14 [email protected] IFAC Symposium on System Identification (SYSID), Beijing, China, October 20, 2015.