Lecture 19 Kalman Filter
1
Introduction
• We observe (measure) economic data, {zt}, over time; but these measurements are noisy. There is an unobservable variable, yt, that drives the observations. We call yt the state variable.
• The Kalman filter (KF) uses the observed data to learn about the unobservable state variables, which describe the state of the model.
• KF models dynamically what we measure, zt, and the state, yt.
yt = g(yt-1, ut, wt)(state or transition equation)
zt = f(yt, xt, vt)(measurement equation) ut, xt: exogenous variables. wt, vt: error terms. 2 Rudolf (Rudi) Emil Kalman (1930 – 2016, Hungary/USA)
1 Intuition: GDP and the State of the Economy
System Error Sources Black External Controls Box (Monetary Policy) State of {u } t Economy System State Optimal Observed (desired, but Estimate of Measurements unknown) {y } System State t {z =GDP } t t ŷ Measuring t Devices (Fed) Estimator
Measurement Error Sources
• The business cycle (the system state), yt, cannot be measured directly • We measure the system at time t: GDP (zt). • Need to estimate “optimally” the business cycle from GDP. 3
How does the KF work?
• It is a recursive data processing algorithm. As new information arrives, it updates predictions (it is a Bayesian algorithm).
• It generates optimal estimates of yt, given measurements {zt}.
• Optimal? – For linear system and white Gaussian errors, KF is “best” estimate based on all previous measurements (MSE sense). – For non-linear system optimality is ‘qualified.’ (say, “best linear”).
• Recursive? – No need to store all previous measurements and re-estimate 4 system as new information arrives.
2 Notation
• For any vector st, we define the prediction of st at time t as: st|t-1 = E(st|It-1 )
That is, it is the best guess of st based on all the information available at time t-1, which we denote by It-1= {zt-1, ..., z1; ut-1, ..., u1; xt-1, ..., x1; ...}.
• As new information is released, we update our prediction:.
st|t = E(st|It)
• The Kalman filter predicts zt|t-1 , yt|t-1 , and updates yt|t.
At time t, we define a prediction error: et|t-1 = zt -zt|t-1
The conditional variance of et|t-1 :Ft|t-1 = E[et|t-1 et|t-1’] 5
Intuitive Example: Prediction and Updating
yt (true value of a company)
• Distribution of true value, yt, is unobservable • Assume Gaussian distributed measurements
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0 0 10 20 30 40 50 60 70 80 90 100
- Observed Measurement at t1: Mean = z1 and Variance = z1
- Optimal estimate of true value: ŷ(t1) = z1 - Variance of error in estimate: 2 (t ) = 2 y 1 z1 6 - Predicted value at time t2, using t1 info: ŷt2|t1 = z1
3 Intuitive Example: Prediction and Updating
• We have the prediction ŷt2|t1. At t2, we have a new measurement: prediction ŷt2|t1
0.16
0.14 measurement 0.12 z(t2) 0.1
0.08
0.06
0.04
0.02
0 0 10 20 30 40 50 60 70 80 90 100
• New t2 measurement:
- Measurement at t2: Mean = z2 and Variance = z2
- Update the prediction due to new measurement: ŷt2|t2 • Closer to more trusted measurement – linear interpolation? 7
Intuitive Example: Prediction and Updating
corrected optimal prediction ŷt2|t1 estimate ŷt2|t2 0.16
0.14
0.12
0.1 measurement
0.08 z(t2) 0.06
0.04
0.02
0 0 10 20 30 40 50 60 70 80 90 100 • Corrected mean is the new optimal estimate of true value
or Optimal (updated) estimate = ŷt|t = ŷt|t-1 + (Kalman Gain) * (zt –zt|t-1) • New variance is smaller than either of the previous two variances or Variance of estimate = Variance of prediction * (1 – Kalman Gain) 8
4 Intuitive Example: Prediction and Updating
0.16 ŷ 0.14 t2|t2 Naïve Prediction 0.12 ŷt3|t2
0.1
0.08
0.06
0.04
0.02
0 0 10 20 30 40 50 60 70 80 90 100
• At time t3, the true values changes at the rate dy/dt=u • Naïve approach: Shift probability to the right to predict • This would work if we knew the rate of change (perfect model). 9 But, this is unrealistic.
Intuitive Example: Prediction and Updating
0.16 Naïve Prediction ŷt2|t2 0.14 ŷt3|t2
0.12
0.1
0.08
0.06 Prediction 0.04 ŷt3|t2
0.02
0 0 10 20 30 40 50 60 70 80 90 100
• Then, we assume imperfect model by adding Gaussian noise • dy/dt = u + w • Distribution for prediction moves and spreads out 10
5 Intuitive Example: Prediction and Updating
0.16 Updated optimal 0.14 estimate ŷt3|t3
0.12 Measurement 0.1 z(t3) 0.08
0.06 Prediction 0.04 ŷt3|t2
0.02
0 0 10 20 30 40 50 60 70 80 90 100
• Now we take a measurement at t3 • Need to once again correct the prediction •Same as before 11
Intuition: Prediction and Updating
• From the intuitive example, we see a simple process:
– Initial conditions (ŷt-1 and t-1)
– Prediction (ŷt|t-1 , t|t-1) • Use initial conditions and model (say, constant rate of change) to make prediction
– Measurement (zt) • Take measurement, zt and learn forecast error, et|t-1=zt -zt|t-1
–Updating (ŷt|t , t|t) • Use measurement to update prediction by ‘blending’ prediction and residual – always a case of merging only two
Gaussians 12 • Optimal estimate with smaller variance
6 Terminology: Filtering and Smoothing
• Prediction is an a priori form of estimation. It attempts to provide information about what the quantity of interest will be at some time t+τ in the future by using data measured up to and including time t-1 (usually, KF refers to one-step ahead prediction).
• Filtering is an operation that involves the extraction of information about a quantity of interest at time t, by using data measured up to and including t.
• Smoothing is an a posteriori form of estimation. Data measured after the time of interest are used for the estimation. Specifically, the smoothed estimate at time t is obtained by using data measured over the interval [0, T], where t < T. 13
Bayesian Optimal Filter • A Bayesian optimal filter computes the distribution
P(yt|zt, zt-1, ..., z1, yt, ..., y1 ) = P(yt|zt)
• Given the following:
1. Prior distribution: P(y0) 2. State space model:
yt|t-1 ~P(yt |yt-1)
zt|t ~P(zt |yt)
3. Measurement sequence: {zt}= {z1, z2, ..., zt}
• Computation is based on recursion rule for incorporation of the new measurement yk into the posterior:
P(zt-1|yt, ..., y1) → P(zt|yt, ..., y1) 14
7 Bayesian Optimal Filter: Prediction Step • Assume we know the posterior distribution of previous time step, t-1:
P(yt-1|zt, zt-1, ..., z1)
• The joint pdf P(yt, yt-1|{zt-1}) can be computed as (using the Markov property):
P(yt, yt-1|{zt-1}) = P(yt|yt-1,{zt-1}) P(yt-1|{zt-1})
= P(yt|yt-1) P(yt-1|{zt-1})
• Integrating over yt-1 gives the Chapman-Kolmogorov equation: P(yt|{zt-1}) = P(yt|yt-1) P(yt-1|{zt-1}) dyt-1
• This is the prediction step of the optimal filter.
15
Bayesian Optimal Filter: Update Step •Now we have: 1. Prior distribution from the Chapman-Kolmogorov equation
P(yt|{zt-1}) 2. Measurement likelihood:
P(zt|yt) • The posterior distribution (=Likelihood x Prior):
P(yt|{zt}) ∝ P(zt|yt) x P(yt|{zt-1})
(ignoring normalizing constant, P(zt|{zt-1}) = P(zt|yt) P(yt|{zt-1}) dyt).
• This is the update step of the optimal filter.
16
8 State Space Form • Model to be estimated:
yt = Ayt-1 + But + wt wt: state noise ~ WN(0,Q) ut: exogenous variable. A: state transition matrix
B: coefficient matrix for ut.
zt = Hyt + vt vt: measurement noise ~ WN(0,R) H: measurement matrix
Initial conditions: y0, usually a RV.
We call both equations state space form. Many economic models can be written in this form.
Note: The model is linear, with constant coefficient matrices, A, B, and H. It can be generalized –see Harvey(1989). 17
State Space Form - Examples Example: VAR(1)
Yt A1Yt 1 t
Y1,t a 11 a 12 Y1,t 1 1 0 w 1,t Transition equation Y 2 ,t a 21 a 22 Y 2 ,t 1 0 1 w 2 ,t
Y1,t 1 z t v t Measurement equation Y 2 ,t 1
18
9 State Space Form - Examples • Example: VAR(p) written in state-space form.
Yt 1Yt 1 2Yt 2 ... pYt p at
1 2 ...... p 1 p Yt a t I n 0...... 0 Y 0 t 1 t F 0 I n ...... 0 wt Y t p 1 0 0 0...... I n 0
Then, we write the transition equation as: t F t wt
and the measurement equation as: z t t
19
State Space Form - Examples • Example: Time-varying coefficients.
zt t tut vt ~ N(0,Vt ) Measurement equation
t t1 w ,t ~ N(0,W ,t ) Transition equation t t1 w ,t ~ N(0,W ,t )
The state of the system is yt = (αt, βt). Ht = [1, ut].
• Example: Stochastic volatility.
zt = H ht + C xt + vt vt~WN(0, R) (Measurement equation) ht = A ht-1 + B ut + wt wt~WN(0, Q) (Transition equation)
20
10 State Space Form - Variations
• When {wt} is WN and independent of y0, {yt} is Markov. - Linearity is not essential for this property.
-{zt} is usually not Markov. zt is conditionally independent given yt: P(zt|zt-1, ..., z1, yt, ..., y1 ) = P(zt|yt)
• If, in addition, {wt, vt, y0} are jointly normal, the model is called Gauss-Markov Model.
• Usually, we assume that only the 1st and 2nd order statistics of {wt}, {vt} are known. Under the following assumptions, we have the Standard second-order model:
- {wt} moments: E[wt] = 0; Cov[wk wl] = Qk δk,l - {vt} moments: E[vt] = 0; Cov[vk vl] = Rk δk,l -Cov[wt, vt] = 0 21 y independent of {w }&{v}; with E[y ]= y ;&Cov(y )=
Mean and Variance of State Vector yt
• yt is a random variable following: yt = Ayt-1 + But + wt - may be unobservable and no data for it.
- it is normally distributed (it is sum of normal variables, wt~N(0,Q): P(yt|It-1) = N(E[yt|It-1], Var[yt|It-1])
• Conditional Mean: E[yt|It-1] = yt|t-1 = Ayt-1|t-1 + But|t-1 In an AR(1) model: E[yt|It-1] = μ + ϕ yt|t-1.
T • Conditional Variance: Var[yt|It-1] = Pt|t-1 = APt-1|t-1 A + Q
Note: There are 2 source of noise: 1) wt & 2) yt-1 –differences between yt-1 and yt|t-1 may not be zero.
• Cov(yt-1, wt) = 0. 22
11 Mean and Variance of zt & Joint (yt, zt)
• zt is a random variable following: zt = Hyt + vt . - it is normally distributed; it is sum of normal variables, vt~N(0,R): P(zt|It-1) = N(E[zt|It-1], Var[zt|It-1])
• Conditional Mean: E[zt|It-1] = zt|t-1 = Hyt|t-1
T • Conditional Variance: Var[zt|It-1] = H Pt|t-1 H + R Note: Cov(yt-1, vt)=0 (since E[wt, vt]=0).
T • Covariance between zt & yt:Cov[zt, yt|It-1] = Pt|t-1 H
• Joint pdf of P(zt, yt|It-1) : y | I Ay Bu P P H' t t1 ~ N t1|t1 t|t1 , t|t1 t|t1 zt | It1 Hyt|t1 Pt|t1H' HPt1|t1H'R 23
Kalman Filter
• Given y0|0 & P0|0 (initialization), the Kalman Filter solves the following equations for t = 1, ..., T
Prediction: yt|t-1 is estimate based on measurements at previous time t: yt|t-1 = A yt-1|t-1 + B ut T Pt|t-1 = A Pt-1 A + Q
Update: yt has additional information – the measurement at time t: yt|t = yt|t-1 + Kt (zt - H yt|t-1) Pt|t = Pt|t-1 - Kt HPt|t-1 = (I - Kt H) Pt|t-1 T T -1 Kt = Pt|t-1 H (H Pt|t-1 H + R) (“Kalman gain”)
• The forecast error is: et|t-1 = zt -zt|t-1 = zt - Hyt|t-1
T 24 • The variance of et|t-1: Ft|t-1= H Pt|t-1 H + R
12 Kalman Filter
• Recall: et|t-1 = zt -zt|t-1 = zt - Hyt|t-1 (forecast error ) T Ft|t-1= H Pt|t-1 H + R (variance of et|t-1)
• The initial values, y0|0 & P0|0, are set to unconditional mean and variance, and reflect prior beliefs about the distribution of yt.
• The update of the state variable, yt|t , and its variance, Pt|t are linear combinations of previous guess and forecast error:
yt|t = yt|t-1 + Kt (zt - H yt|t-1) = yt|t-1 + Kt et|t-1 Pt|t = Pt|t-1 - Kt HPt|t-1 (conditional variance) T T -1 -1 Kt = Pt|t-1 H (H Pt|t-1 H + R) = Cov[zt, yt|It-1] (Ft|t-)
Since we observe zt, the uncertainty (measured by Pt|t) declines.
Note: The bigger Ft|t-1, the smaller Kt and less weight put to updating. 25
Kalman Gain
• Kt depends on the relationship between (zt & yt) and ft|t-1:
T T -1 -1 Kt = Pt|t-1 H (HPt|t-1 H + R) = Cov[zt, yt|It-1] (Ft|t-1)
- The stronger Cov[zt, yt|It-1], the more relevant Kt in the update. - If the relationship is weaker, we do not put much weight as it is likely
not driven by yt. - If we are sure about measurements, R decreases to zero. Kt increases and we weight residuals more heavily than prediction.
• If the model is time-invariant (Q and R are, in fact, constant) the
Kalman gain quickly converges to a constant: Kt → K. In this case, the filter becomes stationary.
26
13 MLE
• Once we have et|t-1 & Ft|t-1, under normality, we can do MLE: T Tn T 1 1 1 ln( 2 ) ln Ft e't Ft et ln( f ( z p , z p 1 ,.. z1 ) 2 2 t p 1 2
Note: Under normality, the KF is optimal in an MSE sense.
• Algorithm (after initialization & ignoring first observations) for (i in 1:T) { zhat = t(H) %*% state10 zvar = (t(H )%*% P10 %*% H + R zvarinv = inv(zvar) eps = t(z[it,.]) - zhat f0 = f0 - ln(det(zvar)) - t(eps) %*% zvarinv %*% eps state11 = state10 + P10 %*% H %*% zvarinv %*% eps P11 = P10 - P10 %*% H %*% zvarinv %*% t(H) %*% P10 state10 = A %*% state11 P10 = A %*% P11 %*% t(A) + Q } f0 = -(T*n/2) * ln(2*pi) + f0/2 27
Derivation of Update
• We wrote the joint of (zt, yt)|It. But, we could have also written the joint of (et, yt)|It: y | I y P P H' t t1 ~ N t|t1 , t|t1 t|t1 HP F et | It1 0 t|t1 t|t1 • Recall a property of the multivariate normal distribution:
If x1 and x2 are jointly normally distributed, the conditional distribution of x1|x2 is also normal with mean 1|2 and variance Σ1|2:
1 1|2 1 12 22 ( x 2 2 ) 1 1|2 11 12 22 21
T -1 Then, yt|t = yt|t-1 + Pt|t-1 H (Ft|t-1) et|t-1
T -1 Pt|t = Pt|t-1 - Pt|t-1 H (Ft|t-1) HPt|t-1 28
14 Feedback
Correction (Measurement Update) Prediction (Time Update)
(1) Compute the Kalman Gain (1) Project the state ahead T T -1 K = Pt|t-1 H (H Pt|t-1 H + R)
yt|t-1 = Ayt-1|t-1 + But|t-1
(2) Update estimate with measurement zk (2) Project the error covariance ahead yt|t = yt|t-1 + Kt (zt -H ŷt|t-1 ) T Pt|t-1 = APt-1|t-1A + Q (3) Update Error Covariance
Pt|t = (I – Kt H)Pt|t-1
29
Kalman Filter: Remarks
• KF works by minimizing E[(yt –yt|t-1)’ (yt –yt|t-1)]. Under this metric, the expected value is the optimal estimator.
• Conditions for optimality (MSE) –Anderson and Moore (1979): - the DGP (linear system & its state space model) is exactly known.
- noise vector (wt, vt) is white noise. - the noise covariances are known.
• KF is an MSE estimator among all linear estimators, but in the case of a Gaussian model it is the MSE estimator among all estimators.
• In practice, difficult to meet the three conditions. A lot of tuning by ad-hoc methods, to get KF that work “sufficiently well.” 30
15 Kalman Filter: Practical Considerations • Estimating Q and R usually involves NW-style SE, based on autocovariances. Estimating Q is, in general, complicated. Tuning Q and R (sometimes called system identification) is usual to improve performance.
• The inversion of Ft|t-1 can be difficult. Usually, the problem comes from having Q singular. In practice, approximations (pseudo- inversion) are used.
• When knowledge/confidence about y0 is low, a diffuse prior for y0 would set P0|0 high.
• The noise vector (wt, vt) should be WN. Autocorrelograms and LB tests can be used to check this.
31
Kalman Filter: Problems and Variations • When the model –i.e., g() and f()– is non-linear, the extended Kalman filter (EKF) works by linearizing the model (similar to NLLS, A & H are Jacobian matrices of partial derivaties.).
Problem: The distributions of the RVs are no longer normal after their respective nonlinear transformations. EKF is an ad-hoc method.
• When the model is highly non-linear, EKF will not work well. The unscented Kalman filter (UKF), which uses MC method to calculate the updates, works better.
• The KF struggles under non-normality and when the dimensions of the state vector increase. Particle filters (sequential Monte Carlo), which is another Bayesian filter, are very popular in these cases. 32
16 Kalman Filter: Smoothing
• We can use all the data to re-estimate our prediction. That is, yt|T.
• Inputs: Initial distribution y0 and data z1, …, zT
• Algorithm: Forward-backward pass (Rauch-Tung-Striebel algorithm)
- Forward pass:
Kalman filter: compute yt+1|t and yt+1|t+1 for 0 ≤ t < T
- Backward pass:
Compute yt|T for 0 ≤ t < T Reverse process in our intuitive example.
33
Kalman Filter: Smoothing – Backward Pass
• Compute yt|T given yt+1|T ~ N(yt+1|T, Pt+1|T )
- Reverse movement from filter: Xt|t → Xt+1|t
- Same as incorporating measurement in filter
1. Compute joint pdf of (yt|t, yt+1|t) 2. Compute conditional distribution (yt|t | yt+1|t = yt+1)
- But: yt+1 is not “known”, we only know its distribution: 3. “Uncondition” on yt+1 to compute yt|T using laws of total expectation and variance
34
17 Kalman Filter: Smoothing – Backward Pass
• Step 1. Compute joint pdf of (yt|t, yt+1|t)
y y Vary Covy , y t|t ~ N t|t , t|t t|t t1|t y y Cov y , y Var y t 1|t t 1|t t 1|t t|t t 1|t y P P AT ~ N t|t , t|t t|t y AP P t 1|t t|t t1|t
• Step 2. Compute conditional pdf of (yt|t|yt+1|t= yt+1)
T 1 T 1 yt|t | yt1|t yt1 Nyt|t Pt|t A Pt1|t yt1 yt1|t ,Pt|t Pt|t A Pt1|t APt|t
1 where we used the conditional result: 1|2 1 12 22 ( x 2 2 ) 1 1|2 11 12 22 22 35
Kalman Filter: Smoothing – Backward Pass
• Step 3. "Uncondition" on yt+1 to compute yt|T. We do not know its value, but only its distribution: yt+1 ~ N().
• Uncondition on yt+1 to compute yt|T using the Law of total expectation and the Law of total variance:
Law of total expectation: E[X] = EZ[E(X|Y = Z)] E[y ] E [E(y | y y )] t|T yt1|T t|t t1|t t1|T Law of total variance:
Var(X) = EZ[Var(X|Y = Z)] + VarZ(E[X|Y = Z]) Var( y ) E [Var( y | y y )] t|T yt1|T t|t t 1|t t 1|T Var E[ y | y y ] yt1|T t|t t 1|t t 1|T 36
18 Kalman Filter: Smoothing – Backward Pass • Step 3 (continuation). From Step 2 we know:
E ( yt|t | yt 1|t yt 1|T ) yt|t Lt yt 1|T yt 1|t
T Var( yt|t | yt 1|t yt 1|T ) Pt|t Lt Pt 1|t Lt Then,
E ( y ) E E ( y | y y ) t|T yt 1|T t|t t 1|t t 1|T
yt|t Lt yt 1|T yt 1|t
Var( y ) E Var( y | y y ) t|T yt1|T t|t t 1|t t 1|T Var E ( y | y y ) yt1|T t|t t 1|t t 1|T T T Pt|t Lt Pt 1|t Lt Lt Pt 1|T Lt T Pt|t Lt (Pt 1|T Pt 1|t )Lt
Kalman Filter: Smoothing – Backward Pass
• Summary: T 1 Lt Pt|t A Pt1|t
yt|T yt|t Lt yt1|T yt1|t T Pt|T Pt|t Lt (Pt1|T Pt1|t )Lt • Algoritrm: for (i in 1:T-1) { P0T = reshape(Psmo[T-it+1,.],rx,rx) state0T = statesmo[T-it+1,.] Pfilt = reshape(P[T-it,.],rx,rx) # this is P t given t Pnext = A*Pfilt*t(A) + Q # this is P t+1 given t Lt = Pfilt*t(A)'*inv(Pnext) P0T = Pfilt + Lt*(P0T - Pnext)*t(Lt) state0T = t(state[capt-it,.]) + Lt*(chsi0T - A*t(state[T-it,.])) statesmo[T-it,.] = t(state0T) Psmo[T-it,.] = t(vec(P0T)) }
19 Kalman Filter: Smoothing – Algorithm • for (t = 0; t < T; ++t) (Kalman filter)
yt1|t Ayt|t T Pt1|t APt|t A Q T T 1 K t 1 Pt1|t H HPt1|t H R
yt1|t1 yt1|t K t1 zt 1 Hyt1|t
Pt1|t1 Pt1|t K t 1HPt 1|t
• for (t = T – 1; t ≥ 0; --t) (Backward pass)
T 1 Lt Pt|t A Pt1|t
yt|T yt|t Lt yt1|T yt1|t T Pt|T Pt|t Lt (Pt1|T Pt1|t )Lt
Kalman Filter: Smoothing - Remarks
• Kalman smoother is a post-processing method.
•Use yt|T’s as optimal estimate of state at time t, and use Pt|T as a measure of uncertainty.
• The smoothing recursion consists of the backward recursion that uses the filtered values of y and P.
20