<<

The

Max Wel ling

California Institute of Technology

Pasadena CA

wellingvisioncaltechedu

Intro duction

Until now we have only lo oked at systems without any dynamics or structure over time The mo dels had

spatial structure but were dened at one moment in time ie they were static In this lecture we will

analyse a p owerfull dynamic mo del which could b e characterized as factor analysis over time although

the numb er of observed features is not necessarily larger than the numb er of factors The idea is again to

compute only the mean and the ie to characterize the probabilities by Gaussians This

has the advantage of b eing completely tractible for strongly nonlinear systems the Gaussian assumption

can no longer hold The p ower of the Kalman Filter KF is that it op erates online This implies that

to compute the b est estimate of the state and its uncertainty we can up date the previous estimates by the

new measurement This implies that we dont have to consider all the previous data again to compute

the optimal estimates we only need to consider the estimates from the previous time step and the new

measurement

So what are KFs usually used for or what do they mo del It is not hard to motivate the KF b ecause in

practice it can b e used for almost everthing that moves Popular applications include guidance

tracking sonar ranging satellite orbit computation sto ck prize etc These applications

can summarized as denoising tracking and control

It is used in all sorts of elds like seismology bio engineering etc For instance

when the Eagle landed on the mo on it did so with a KF Also gyroscop es in airplanes use KFs And the

list go es on and on

The main idea is that we like to estimate a state of some sort lo cation and velo city of airplane and its

uncertainty However we do not directly observe these states We only observe some measurements from

an array of sensors which are noisy As an additional complication the states evolve in time also with its

own or uncertainties The question now b ecomes how can we still optimally use our measurements to

estimate the unobserved hidden states and their uncertainties

In the following we will rst describ e the Kalman Filter and derive the KF equations We assume that

the parameters of the system are xed known Then we derive the Kalman Smo other equations which

allow us to use measurements forward in time to help predict the state at the current time b etter Because

these estimates are usually less noisy than the if we used measurements up till the current time only we say

that we smo oth the state estimates Next we will show how to employ the Kalman Filter and smo other

equations to eciently estimate the parameters of the mo del from training data This will then b e not

surprisingly another instance of the EM In the app endix you will nd some usefull lemmas and

equalities together with the derivation of the lagone smo other which is needed for learning

Intro ductory Example

Let me briey describ e an example Consider a ship at sea which has lost its b earings They need to estimate

their current p osition using the p ositions of the stars In the meantime the ships moves on on a wavy sea

The question b ecomes how can we incorp orate the measurement optimally to estimate the ships lo cation

at sea Lets assume that the mo del for the ships lo cation over time is given by

y y c w

t t

t+1

1

This contains a drift term constant velo city c and a noise term A noisy measurement is describ ed by

x y v

t t t

2

Lets also assume we have some estimate y of the lo cation at some time t and some uncertainty How

t

t

do es this change as the ships sails for one second Of course it will drift in the direction of its velo city by a

distance c However the uncertainty grows due to the waves This is expressed as

y y c

t+1 t

1

The constant drift term will not b e used in the main b o dy of these classnotes

The uncertainty is given by

2 2 2

t+1 t w

If we do not do any measurements the uncertainty in the p osition will keep growing until we have no clue

anymore as to where we are If we add the information of a measurement the nal estimate is weighted

average b etween the observed p osition and the previous estimate

2

2

t+1

v

0

y x y

t+1 t+1

t+1

2 2

2 2

v v

t+1 t+1

We observe that if we have innite condence in the measurement then the new lo cation estimate

v

is simply equal to the measurement Also if we have innite condence in the previous estimate the

measurement is ignored For the new uncertainty we nd

2 2

2

t+1 v

0

t+1

2

2

v

t+1

This is also easy to interpret since it says that if one of the uncertainties dissapp ears the total uncertainty

disapp ears since the other estimate can simply b e ignored in that case Notice that the uncertainty always

decreases or stays equal by adding the measurement The estimates for the lo cation and uncertainty

incorp oratating the measurement can b e rewritten as follows

0

y y K x y

t+1 t+1 t+1 t+1

t+1

2

0 2

K

t+1

t+1 t+1

2

t+1

K

t+1

2

2

v

t+1

From this we can see that the state estimate is corrected by the measurement error multiplied by a gain

factor the Kalman gain If the gain is zero no attention is paid to the measurement if its one we simply

use the measurement as our new state estimate Similarly for the uncertainties if the measurement is

innitely accurate the gain is one which implies that there is no uncertainty left On the other hand if the

measurement is worthless the gain is zero which therefore do es not decrease the overall uncertainty

The ab ove one dimensional example will b e generalized to higher dimensions in the following

The Mo del

Let us rst intro duce the state of the KF y at time t The state is a vector of dimension d and remains

t

unobserved At every time t we alo have a k dimenional vector of observations x which dep end on the state

t

and some additive Gaussian noise We will assume the following dynamical mo del for the KF

y Ay w

t+1 t t

x By v

t t t

Note that the dynamics is governed by a Markov pro cess ie the state at y is indep endent of all other

t+1

states given y The evolution noise and the measurement noise are assumed white and Gaussian ie

t

distributed according to

w G Q

w

v G R

v

The noise vectors v and w are also assumed to b e uncorrelated with the the state y From this we simply

t t t

derive

Ey v t k

t k

Ey w t k

t k

Ex v t k

t k

Ex w t k

t k

Ev w t k

t k

Ev v t k R t k

t k

Ew w t k Q t k

t k

The ab ove mo del could b e considered as a factor analysis mo del over time ie at every instant we have a

FA mo del where the factors now dep end on the factors of a previous time step

The initial state y is distributed according to

1

y G 

1

It is easy to generalize this mo del to include a drift and external inputs The drift is a constant change

expressed by adding  to the dynamical equation

y Ay  w

t+1 t t

0 0

A y w

t

t

0 0

where we incorp orated the constant again through the redenitions A A  and y y For the

t

t

external inputs we let the evolution dep end on a l dimensional vector of inputs u as follows

t

y Ay Cu w

t+1 t t t

This mo del is used when we want to control the system We could even let the parameters fA  C g dep end

on time However in the rest of this chapter we will assume the simplest case ie a linear evolution without

drift or inputs and a linear measurement equation with white uncorrelated noise Since the initial state is

Gaussian and the evolution equation is linear this implies that the state at later times will remain Gaussian

General Prop erties

We want to b e able to estimate the state and the covariance of the state at any time t given a set of

observations x fx x g If is equal to the current time t we say that we lter the state If is

1

smaller than t we say that we predict the state and nally if is larger the t we say that we smo oth the

state The main probability of interest is

py jx

t

since it conveys all the information ab out the state y at time t given all the observations up to time

t

Since we are assuming that this probability is Gaussian we only need to calculate its mean and covariance

denoted by

y Ey jx

t

t

P Ey y jx

t t t

ie the state prediction error Notice that these quantities still dep end on y y where we dened y

t

t t

the random variables x and are therefore random variables themselves We will now prove however that

the covariance P do es actually not dep end on x P may b e considered as a parameter therefore in the

following To pro of the ab ove claim we simply show that the correlation b etween the random variables y

t

and x vanishes For normally distributed random variables this implies that they are indep endent

and x fx x g are indep endent y y Lemma The random variables y

1 t

t t

pro of

x x Ey x Ey

t

t

Z Z Z

dy dx py x y x dx px x dy py jx y

t t t t t t

Z Z

dy dx py x y x dy dx px py jx y x

t t t t t t

since py x px py jx

t t

From this we derive the following corollary

y Ey y Ey y Ey P

t t

2 1

t t t t t t

2 1 2 1 1 2

t1

Another usefull result we will need is the fact that the predicted measurement error " x By is

t t

t

t1

indep endent of the measurements x The predicted measurement error is also called the innovation since

t1

it represents that part of the new measurement x that can not b e predicted using knowledge of the x

t

measurements since they are indep endent

t1

t1

Lemma The random variables " x By and x fx x g are indep endent

t t 1 t1

t

Pro of The pro of is simple and pro ceeds again by proving that they are uncorrelated

t1

t1 t1 t1

E" x Ex x BEy x

t t

t

t1

t1 t1 t1

BEy x BEv x BEy x

t t

t

t1

t1 t1

BEy x BEv x

t

t

t1

t1

where we used the result that y is indep endent of x proved directly ab ove and

t

t1

Notice that since the innovation " is a function of x this also implies the following corollary

t1

E" " for t

t

Before we pro ceed we would like to remark we have not well motivated y as the preferred estimate of

t

the state y given data x Other p ossible choices could b e the most likely state given the data or the

t

most likely sequence of states y y given the data It turns out that for Gaussian distributed random

1 t

variables these ob jectives are equivalent On top of that it turns out that the y is also the minimal

t

ie it minimizes P

t

Kalman Filter Equations

We will now pro ceed to derive the Kalman lter equations ie t First write

t1

px jy py jx

t t t

t

py jx

t

t1

px jx

t

where

Z

t1 t1

py jx dy py jy py jx

t t1 t t1 t1

The denominator in is an unimp ortant normalization factor The remaining densities are given by

By R px jy G

t t t x

t

Ay Q py jy G

t1 t t1 y

t

The equations and have interesting interpretations as a reactive reenforcement due to an observation

and a diusion equation for the Gaussian probability b etween observations Equation is basically

applying Bayes law in the same way as we did for Bayesian learning ie we are up dating the probability

distribution of an unknown random variable by including evidence This has the eect of making the

distribution p eakier ie less uncertain The second equation evolves the hidden state from one instant

to the next without considering more evidence This has the eect of making the distribution less p eakier

t t1

ie intro ducing more uncertainty Together these equations express py jx in terms of py jx and

t t1

may b e used recursively It is now easy to verify that in case of a prediction only the equation remains

ie

Z

py jx dy py jy py jx t

t t1 t t1 t1

The case when t ie smo othing is more dicult and will b e dealt with later Let us return to the

t1 t1

t1

ltering case We like to calculate the mean y and covariance P of the p df py jx expressed in

t

t t

t1 t1

t1

terms of the mean y and covariance P of the density py jx Notice that these two

t1

t1 t1

determine the densities completely since they are Gaussian For the mean we nd

t1

t1

y Ey jx

t

t

t1 t1

AEy jx Ew jx

t1 t1

t1

Ay

t1

where we have used and the fact that w has zero mean For the covariance we write we rst write

t1 t1

y y y

t

t t

t1

Ay w Ay

t1 t1

t1

t1

w Ay

t1

t1

t1

t1

and notice that w is indep endent of y since this is a function of y and x while w is inde

t1 t1 t1

t1

p endent of b oth using and Thus we can write

t1 t1 t1

T

P Ey y

t t t

t1 t1

T

w w Ay EAy

t1 t1

t1 t1

t1 t1

T T T

AEy y A Ew w

t1

t1

t1 t1

t1

T

AP A Q

t1

t t

Next we wish to calculate y and P interms of the ab ove calculated quantities using equation We

t t

write

t1 t1

By R y P G G

t x y

t t

t t

t

py jx

t

t1

px jx

t

We will now use Lemma applying to the second term in the numerator of and then applying

to the result of that we nd

t1 t1 t1 t1

1 T 1 1 1 T 1 1 T 1 1 t t

P B R B P y B R x P B R B py jx k x G

t t 2 y

t t t t t

Because we know that the multplication of two Gaussians is again a Gaussian and moreover that must

t

b e normalized with resp ect to y we deduce the the factor k x Finally we must use Lemma to

t 2

t

show that py jx is a Gaussian distribution with the following mean and covariance

t

t1 t1

t

y y K x By

t t

t t t

t1

t

P I K BP

t

t t

t1 t1

T T 1

K P B R BP B

t

t t

0 0

These equations  and P where K is called the Kalman gain These equations are initialized by y

t

1 1

together with and constitute the celebrated Kalman Filter equations and allow one to estimate the

state of the system online ie every new observation can b e used recursively given the information that

was already received b efore Notice that the gain factor K grows if the measurement covariance R b ecomes

t

smaller thus putting more weight on the measurement residual dierence b etween predicted an actual

t1

measurement Also if the noise covariance P b ecomes smaller less emphasis is put on the measurement

t

t

residual It is also instructive to notice that the evolution of the state noise P and therefore the Kalman

t

gain K is indep endent of the measurements and states and may b e precomputed From a numerical p oint

t

of view equation is not preferable due to the fact that it is a dierence of two p ostive denite matrices

which is not guaranteed to result in a p ositive denite matrix and may lead to numerical instabilities This

however is easily xed by noting that from we can derive

t1

T T T

I K BP B K K RK

t t

t t

which can b e used to rewrite as

t1

T T t

I K BP I K B K RK P

t t t

t t t

which is a sum of two p ostive denite matrices

Kalman Smo other Equations

Next we want to solve the smo othing problem This implies that we are going to include later measurements

x t to improve our estimates of the states y The resultant estimates will b e smo other less noisy

t

First concentrate on the mean Ey jx for t We will now invoke the corollary of Lemma

t1

identifying y y x y  y  y and G  py y jx Using these identications

t1 t z t1 t

y t1 x t

we nd

y Ey jx Ey jy y x

t1 t1 t

t1 t

Next we write

t1

py y x x x

t1 t t

py y jx

t1 t

px

t1 t1 t1 t1

px x jy y x py jy x py jx px

t t1 t t t1 t1

px

t1

px x jy py jy py jx

t t t t1 t1

t1

px x jx

t

t1

k y x py jy py jx

1 t t t1 t1

t1 t1

y Ay Q G k y x G P

t1 y 1 t y

t1 t

t1 t1

In the same spirit as the derivation for the kalman lter see derivation around we will now use Lemma

applying to the second term in and then applying to the result of that

py y jx

t1 t

py jy x

t1 t

py jx

t

t1 t1 t1 t1

1 T 1 1 1 T 1 1 T 1 1

P A Q A P y A Q y P A Q A k y x G

t 2 t y

t1

t1 t1 t1 t1

Notice the similarity b etween and Because we know that the multplication of two Gaussians is

again a Gaussian and moreover that must b e normalized with resp ect to y we deduce the the factor

t1

k y x Finally we invoke and Lemma to nd

4 t

t1 t1

y y J y y

t1

t1 t t

t1

t1 t1

T 1

J P A P

t1

t

t1

computed from the Kalman Filter equations where we used in the last line This is initialized with y

For the covariance we rst observe

t1 t1

y y y J J Ay

t1 t1

t t1

t1 t1

where we used and Multiplying b oth sides with their resp ective transp ose from the right taking

exp ectations and using the fact that y is indep endent of y since the latter is a function of x and

t1 t

t1 t1

Lemma and similarly for y and y gives us

t

t1

t1 t1 t1

T T T

P J Ey y J P J AEy y A J

t1 t1

t1 t t t1 t1

t1 t1 t1

Then we use

Ey y Ey y P

t t

t t t

T

EAy w Ay w P

t1 t1 t1 t1

t

T

AEy y A Q P

t1 t1

t

where and and the corollary following Lemma was used Analoguesly

t1 t1 t1

y Ey y P Ey

t1 t1

t1 t1 t1

Putting these together and using we nd

t1 t1

T

P P J P P J

t1

t1 t t t1

t1

which is initialized by P computed from the Kalman Filter equations Equations and are

the so called Kalman smo other equations If we want to estimate a state y given data x with t then

t

we rst apply the Kalman lter equations recursively until we have reached the state y While moving

t1 t1

t t

forward we store the values for y y P and P t Then we move backward by applying the

t t t t

smo other equations until we have reached the state t we would like to estimate Because we include more

observations in the estimation of the state the result will b e less noisy as compared to the Kalman lter

result hence the name smo other

Parameter Estimation for the Kalman Mo del

We will now pro ceed to estimate the parameters f B R A Qg of the Kalman lter mo del using EM

We consider the states y as hidden variables while x are the observations We assume we have observed

t

N sequences of length The joint probability of the complete data is given by

Y Y

py x py py jy px jy

1 t t1 t t

t=2 t=1

For EM we are interested in the exp ectation of the joint p df over the p osterior density

Z

N

X

Q dy py jx log py x

n n

n=1

Z

N

X

dy py jx d k log

n

n=1

log det log det Q log det R

T 1

y  y 

1 1

X

T 1

y Ay Q y Ay

t t1 t t1

t=2

X

T 1

x By R x By

tn t tn t

t=1

Insp ection of this ob jective function reveals that the only sucient statistics that need to b e calculated in

the Estep are

Ey jx y t

t

n tn

n

Ey y jx P y y M t

t t

n t tn tn t

n

Ey y jx P y y M t

t t1

n tt1 tn t1n tt1

Fortunately except for the last one these are precisely the quantities that can b e calculated through the

Kalman Filter and Smo other recursions The last quantity which is called the lagone covariance smoother

is computed in the app endix C It is given by the following recursion

t1 t1

T T

P P J J P AP J

t1

t1t2 t2 tt1 t2

t1 t1

which is initialized by

1

P I K BAP

1

1

We now concentrate on the Mstep which maximizes with resp ect to the parameters of the joint

density only ie the parameters present in the p osterior are held xed

Taking derivatives with resp ect to  and equating to zero gives

N

X

1

 y Q

1n



n=1

N

X

y 

1n new

N

n=1

For this implies

N

X

T T T n

Q N   y y M

1n 1n 1

1

n=1

N N

X X

n T T

M   P y  y 

new

1 new new 1 1n new 1n new

N N

n=1 n=1

where we used that P is indep endent of x Taking derivatives with resp ect to A gives

t n

N

X X

1 n 1 n

Q Q M Q AM

tt1 t1

A

n=1 t=2

1

N N

X X X X

n n

M A M

new

t1 tt1

t=2 n=1 t=2 n=1

For Q we nd

N

X X

n n T n T

Q N Q M AM M A AM A

tt1

t t1t t1

Q

n=1 t=2

N

X X

n n

M A M Q

new new

t t1t

N

n=1 t=2

n n T

where we used and M M For B we have

t1t tt1

N

X X

1 n 1

y R BM Q R x

tn

tn t

B

n=1 t=1

1

N N

X X X X

n

B y x M

new tn

tn t

n=1 t=1 n=1 t=1

And nally we have for R

N

X X

T n T

Q N Q y B By x BM B x x x

tn tn tn tn

tn tn t

R

n=1 t=1

N

X X

R y x x x B

new tn tn tn new

tn

N

n=1 t=1

where we used Alternating Esteps and Msteps will thus converge to the maximum likeliho o d estimates

of these parameters Notice that the recursion equations full a double role They may b e used to eciently

compute the Estep in the learning problem and once the parameters are xed to estimate the optimal

state and statecovariance of the dynamical system p ossibly on line

Computation of Likeliho o d

To monitor the total loglikeliho o d of the system we may calculate

N N N

X X X X

t1

logpx logpx jx L log px

1n tn

n n

n=1 n=1 t=2 n=1

The mean and covariance of the Gaussians px jx can b e computed as follows omitting n for nota

tn t1n

tional convenience

Z

t1

t1

x dx px jx x

t t t

t

Z Z

t1

dx dy px jy py jx x

t t t t t t

Z Z

t1 t1

dx P x y By R G dy G

t t t y t x

t t t t

Z

t1 t1

y P By dy G

t t y

t

t t

t1

By

t

For px the ab ove calculation gives

1

0

x B

1

Similarly for the covariance we nd

Z

t1 t1 t1

t1

H dx px jx x x x x

t t t t

t t t

Z Z

t1 t1 t1 t1

dx y P x x x x By R G dy G

t t t t y t x

t t t t t t

Z

t1 t1 t1 t1

T

y P R By y B x x dy G

t t t y

t t t t t

t1 t1 t1 t1 t1

T T

R BP y y B By y B

t t t t t

t1

T

R BP B

t

The covariance for px is then given by

1

0 T

H R BB

1

A Lemmas

Lemma Let G  denote a normal density with mean  and covariance We have the following

y

identities

T 1 1 T 1 T 1 1

G Ay k x G A A A x A A

x 1 y

1 1 1 1 1 1 1 1

G a A G b B G A B A a B b A B G b A B

y y y a

i h

y y y x

then we have Also if we write z y x    and

y x

xy xx

Z

y dx G  G 

z y y y

y

G 

z

1 1

y jx G   x

y y x xx y y y x xx xy

y x

G 

x xx

x

As a corollary we notice that

Ey jx  Ey

x

Lemma Consider a d d matrix P a k k matrix R and a k d matrix B where P implies

T

a Pa a ie p ositive eigenvalues The following equalities hold

1 T 1 1 T T 1

P B R B P PB BPB R BP

1 T 1 1 T 1 T T 1

P B R B B R PB BPB R

B Matrix Identities

In the derivations to follow the following idetities are useful

T T

a Ab trAba

trAB trBA

T

trAB B

A

T

B B tr A

A

1

log detA log det A

T 1

log detA A

A

C LagOne Covariance Smo other

In the Estep of the EM algorithm that estimated the parameters of the KF we needed the lter and smo other

equations However one quantity remains undetermined which is the so called lagonecovariancesmoother

P This quantity is only needed backwards First we will derive the initial value the backward recursions

tt1

t t t

P Ey y

tt1 t t1

t1 t1 t1 t1

J K x By g Efy K x By gfy

t1 t t t t

t t t

t1

t1 t1 t1 t1

Efy K By v gfy J K By v g

t t t1 t t

t t t

t1

t1 t1

and y are where we used and with t Next we write this out and use that b oth y

t

t1

indep endent of v which is proved using and This will give

t

t1 t1 t1 t1

t T T T T T T

P P P B K J K BP K BP B RK J

t t

tt1 t t t1 t t t1

tt1 tt1

t1

I K BAP

t

t1

t1 t1

where in the last line we used and P AP which is proved using the fact that w is

t1

tt1 t1

t1

indep endent of y If we set t we nd the initial condition

t1

1

P I K BAP

1

1

The derivation for the backward recursion is somewhat elab orate First we write from and

t1 t1

y y y J J Ay

t1 t1

t t1

t1 t1

t2 t2

y y y J J Ay

t2 t2

t1 t2

t2 t2

Next we equate

T

y y y J Ey J

t2 t1

t2 t t2 t1

t1 t1 t2 t2

T

Ey J Ay y J Ay

t1 t2

t1 t1 t2 t2

T

P J Ey y J

t1

t1t2 t t1 t2

t1 t2 t1 t2 t1 t2

T T

Ey y J AEy y J AEy y A J

t1 t1

t2

t1 t2 t1 t2 t1 t2

where lemma was used several times to get rid of crossterms We will now rewrite some of the terms

app earing ab ove

Ey y Ey y P

t t1

t t1 tt1

T

EAy w Ay w P

t1 t1 t2 t2

tt1

T

AEy y A AEy w P

t1 t2 t1 t2

tt1

T

AEy y A AEAy w w P

t1 t2 t2 t2 t2

tt1

T

AEy y A AQ P

t1 t2

tt1

Next

t1 t2 t2 t2 t2

Ey y Efy K x By gy

t1 t1

t1 t2 t1 t1 t2

t2 t2 t2

P K EBy v y

t1 t1

t1t2 t1 t2

t2 t2

P K BP

t1

t1t2 t1t2

where and was used Next

t1 t2

Ey y

t1 t2

t2 t2 t2

Efy K By v gy

t1 t1

t1 t1 t2

t2

K BP

t1

t1t2

Finally we have

t1 t2 t2 t2

Ey y Ey K " y

t1 t1

t1 t2 t1 t2

t2 t2

Ey y

t1 t2

t2

Ey y P

t1 t2

t1t2

where the most imp ortant ingredient was lemma Putting this together we have

P

t1t2

t2 t2

P K BP

t1

t1t2 t1t2

t2 t2

T T T

J AK BP J AP A J J AQJ

t1 t1 t1 t1

t2 t2

t1t2 t1t2

T

J P J

t1

tt1 t2

The second line can b e rewriten as follows

t2

I K BP

t1

t1t2

t2 t2 t2

1

I K BP P AP

t1

t1 t1 t2

t1

T

P J

t2

t1

where we used and The third line can b e rewritten as

t2 t2

T T T

J AK BP P A J QJ

t1 t1

t2 t2

t1t2 t1t2

t2 t2

T T

J AK BP P A QJ

t1 t1

t2

t1 t1t2

t2 t2

T

J AK BP P J

t1 t1

t2

t1 t1

t2

T

J AI K BP J

t1 t1

t2

t1

t1

T

J AP J

t1

t2

t1

where again and was used and the fact that

t2 t2

T

P P A Q

t1 t1t2

which can b e derived analoguesly to Finally putting this together we derive the lagone covariance

smo other

t1 t1

T T

P P J J P AP J

t1

t1t2 t2 tt1 t2

t1 t1