<<

Lecture 19

1

Introduction

• We observe (measure) economic data, {zt}, over time; but these measurements are noisy. There is an unobservable variable, yt, that drives the observations. We call yt the .

• The Kalman filter (KF) uses the observed data to learn about the unobservable state variables, which describe the state of the model.

• KF models dynamically what we measure, zt, and the state, yt.

yt = g(yt-1, ut, wt)(state or transition equation)

zt = f(yt, xt, vt)(measurement equation) ut, xt: exogenous variables. wt, vt: error terms. 2 Rudolf (Rudi) Emil Kalman (1930 – 2016, /USA)

1 Intuition: GDP and the State of the Economy

System Error Sources Black External Controls Box (Monetary Policy) State of {u } t Economy System State Optimal Observed (desired, but Estimate of Measurements unknown) {y } System State t {z =GDP } t t ŷ Measuring t Devices (Fed)

Measurement Error Sources

• The business cycle (the system state), yt, cannot be measured directly • We measure the system at time t: GDP (zt). • Need to estimate “optimally” the business cycle from GDP. 3

How does the KF work?

• It is a recursive data processing . As new information arrives, it updates (it is a Bayesian algorithm).

• It generates optimal estimates of yt, given measurements {zt}.

• Optimal? – For linear system and white Gaussian errors, KF is “best” estimate based on all previous measurements (MSE sense). – For non-linear system optimality is ‘qualified.’ (say, “best linear”).

• Recursive? – No need to store all previous measurements and re-estimate 4 system as new information arrives.

2 Notation

• For any vector st, we define the of st at time t as: st|t-1 = E(st|It-1 )

That is, it is the best guess of st based on all the information available at time t-1, which we denote by It-1= {zt-1, ..., z1; ut-1, ..., u1; xt-1, ..., x1; ...}.

• As new information is released, we update our prediction:.

st|t = E(st|It)

• The Kalman filter predicts zt|t-1 , yt|t-1 , and updates yt|t.

At time t, we define a prediction error: et|t-1 = zt -zt|t-1

The conditional of et|t-1 :Ft|t-1 = E[et|t-1 et|t-1’] 5

Intuitive Example: Prediction and Updating

yt (true value of a company)

• Distribution of true value, yt, is unobservable • Assume Gaussian distributed measurements

0.16

0.14

0.12

0.1

0.08

0.06

0.04

0.02

0 0 10 20 30 40 50 60 70 80 90 100

- Observed Measurement at t1: Mean = z1 and Variance = z1

- Optimal estimate of true value: ŷ(t1) = z1 - Variance of error in estimate: 2 (t ) = 2 y 1 z1 6 - Predicted value at time t2, using t1 info: ŷt2|t1 = z1

3 Intuitive Example: Prediction and Updating

• We have the prediction ŷt2|t1. At t2, we have a new measurement: prediction ŷt2|t1

0.16

0.14 measurement 0.12 z(t2) 0.1

0.08

0.06

0.04

0.02

0 0 10 20 30 40 50 60 70 80 90 100

• New t2 measurement:

- Measurement at t2: Mean = z2 and Variance = z2

- Update the prediction due to new measurement: ŷt2|t2 • Closer to more trusted measurement – linear interpolation? 7

Intuitive Example: Prediction and Updating

corrected optimal prediction ŷt2|t1 estimate ŷt2|t2 0.16

0.14

0.12

0.1 measurement

0.08 z(t2) 0.06

0.04

0.02

0 0 10 20 30 40 50 60 70 80 90 100 • Corrected mean is the new optimal estimate of true value

or Optimal (updated) estimate = ŷt|t = ŷt|t-1 + (Kalman Gain) * (zt –zt|t-1) • New variance is smaller than either of the previous two or Variance of estimate = Variance of prediction * (1 – Kalman Gain) 8

4 Intuitive Example: Prediction and Updating

0.16 ŷ 0.14 t2|t2 Naïve Prediction 0.12 ŷt3|t2

0.1

0.08

0.06

0.04

0.02

0 0 10 20 30 40 50 60 70 80 90 100

• At time t3, the true values changes at the rate dy/dt=u • Naïve approach: Shift probability to the right to predict • This would work if we knew the rate of change (perfect model). 9 But, this is unrealistic.

Intuitive Example: Prediction and Updating

0.16 Naïve Prediction ŷt2|t2 0.14 ŷt3|t2

0.12

0.1

0.08

0.06 Prediction 0.04 ŷt3|t2

0.02

0 0 10 20 30 40 50 60 70 80 90 100

• Then, we assume imperfect model by adding Gaussian • dy/dt = u + w • Distribution for prediction moves and spreads out 10

5 Intuitive Example: Prediction and Updating

0.16 Updated optimal 0.14 estimate ŷt3|t3

0.12 Measurement 0.1 z(t3) 0.08

0.06 Prediction 0.04 ŷt3|t2

0.02

0 0 10 20 30 40 50 60 70 80 90 100

• Now we take a measurement at t3 • Need to once again correct the prediction •Same as before 11

Intuition: Prediction and Updating

• From the intuitive example, we see a simple process:

– Initial conditions (ŷt-1 and t-1)

– Prediction (ŷt|t-1 , t|t-1) • Use initial conditions and model (say, constant rate of change) to make prediction

– Measurement (zt) • Take measurement, zt and learn forecast error, et|t-1=zt -zt|t-1

–Updating (ŷt|t , t|t) • Use measurement to update prediction by ‘blending’ prediction and residual – always a case of merging only two

Gaussians 12 • Optimal estimate with smaller variance

6 Terminology: Filtering and

• Prediction is an a priori form of estimation. It attempts to provide information about what the quantity of interest will be at some time t+τ in the future by using data measured up to and including time t-1 (usually, KF refers to one-step ahead prediction).

• Filtering is an operation that involves the extraction of information about a quantity of interest at time t, by using data measured up to and including t.

• Smoothing is an a posteriori form of estimation. Data measured after the time of interest are used for the estimation. Specifically, the smoothed estimate at time t is obtained by using data measured over the interval [0, T], where t < T. 13

Bayesian Optimal Filter • A Bayesian optimal filter computes the distribution

P(yt|zt, zt-1, ..., z1, yt, ..., y1 ) = P(yt|zt)

• Given the following:

1. Prior distribution: P(y0) 2. State space model:

yt|t-1 ~P(yt |yt-1)

zt|t ~P(zt |yt)

3. Measurement sequence: {zt}= {z1, z2, ..., zt}

• Computation is based on recursion rule for incorporation of the new measurement yk into the posterior:

P(zt-1|yt, ..., y1) → P(zt|yt, ..., y1) 14

7 Bayesian Optimal Filter: Prediction Step • Assume we know the posterior distribution of previous time step, t-1:

P(yt-1|zt, zt-1, ..., z1)

• The joint pdf P(yt, yt-1|{zt-1}) can be computed as (using the Markov property):

P(yt, yt-1|{zt-1}) = P(yt|yt-1,{zt-1}) P(yt-1|{zt-1})

= P(yt|yt-1) P(yt-1|{zt-1})

• Integrating over yt-1 gives the Chapman-Kolmogorov equation: P(yt|{zt-1}) = P(yt|yt-1) P(yt-1|{zt-1}) dyt-1

• This is the prediction step of the optimal filter.

15

Bayesian Optimal Filter: Update Step •Now we have: 1. Prior distribution from the Chapman-Kolmogorov equation

P(yt|{zt-1}) 2. Measurement likelihood:

P(zt|yt) • The posterior distribution (=Likelihood x Prior):

P(yt|{zt}) ∝ P(zt|yt) x P(yt|{zt-1})

(ignoring normalizing constant, P(zt|{zt-1}) = P(zt|yt) P(yt|{zt-1}) dyt).

• This is the update step of the optimal filter.

16

8 State Space Form • Model to be estimated:

yt = Ayt-1 + But + wt wt: state noise ~ WN(0,Q) ut: exogenous variable. A: state transition

B: coefficient matrix for ut.

zt = Hyt + vt vt: measurement noise ~ WN(0,R) H: measurement matrix

Initial conditions: y0, usually a RV.

We call both equations state space form. Many economic models can be written in this form.

Note: The model is linear, with constant coefficient matrices, A, B, and H. It can be generalized –see Harvey(1989). 17

State Space Form - Examples Example: VAR(1)

Yt  A1Yt 1   t

Y1,t   a 11 a 12  Y1,t 1  1 0   w 1,t              Transition equation Y 2 ,t   a 21 a 22  Y 2 ,t 1  0 1   w 2 ,t 

Y1,t 1  z t     v t Measurement equation Y 2 ,t 1 

18

9 State Space Form - Examples • Example: VAR(p) written in state-space form.

Yt  1Yt 1   2Yt 2  ... pYt  p  at

 1  2 ......  p 1  p  Yt    a t    I n 0...... 0 Y   0   t 1       t  F  0 I n ...... 0 wt              Y   t  p 1    0  0 0...... I n 0 

Then, we write the transition equation as:  t  F t  wt

and the measurement equation as: z t   t

19

State Space Form - Examples • Example: Time-varying coefficients.

zt  t  tut  vt ~ N(0,Vt ) Measurement equation

t  t1  w ,t ~ N(0,W ,t ) Transition equation t  t1  w ,t ~ N(0,W ,t )

The state of the system is yt = (αt, βt). Ht = [1, ut].

• Example: Stochastic volatility.

zt = H ht + C xt + vt vt~WN(0, R) (Measurement equation) ht = A ht-1 + B ut + wt wt~WN(0, Q) (Transition equation)

20

10 State Space Form - Variations

• When {wt} is WN and independent of y0, {yt} is Markov. - Linearity is not essential for this property.

-{zt} is usually not Markov. zt is conditionally independent given yt: P(zt|zt-1, ..., z1, yt, ..., y1 ) = P(zt|yt)

• If, in addition, {wt, vt, y0} are jointly normal, the model is called Gauss-Markov Model.

• Usually, we assume that only the 1st and 2nd order of {wt}, {vt} are known. Under the following assumptions, we have the Standard second-order model:

- {wt} moments: E[wt] = 0; Cov[wk wl] = Qk δk,l - {vt} moments: E[vt] = 0; Cov[vk vl] = Rk δk,l -Cov[wt, vt] = 0 21 y independent of {w }&{v}; with E[y ]= y ;&Cov(y )=

Mean and Variance of State Vector yt

• yt is a random variable following: yt = Ayt-1 + But + wt - may be unobservable and no data for it.

- it is normally distributed (it is sum of normal variables, wt~N(0,Q): P(yt|It-1) = N(E[yt|It-1], Var[yt|It-1])

• Conditional Mean: E[yt|It-1] = yt|t-1 = Ayt-1|t-1 + But|t-1 In an AR(1) model: E[yt|It-1] = μ + ϕ yt|t-1.

T • Conditional Variance: Var[yt|It-1] = Pt|t-1 = APt-1|t-1 A + Q

Note: There are 2 source of noise: 1) wt & 2) yt-1 –differences between yt-1 and yt|t-1 may not be zero.

• Cov(yt-1, wt) = 0. 22

11 Mean and Variance of zt & Joint (yt, zt)

• zt is a random variable following: zt = Hyt + vt . - it is normally distributed; it is sum of normal variables, vt~N(0,R): P(zt|It-1) = N(E[zt|It-1], Var[zt|It-1])

• Conditional Mean: E[zt|It-1] = zt|t-1 = Hyt|t-1

T • Conditional Variance: Var[zt|It-1] = H Pt|t-1 H + R Note: Cov(yt-1, vt)=0 (since E[wt, vt]=0).

T • between zt & yt:Cov[zt, yt|It-1] = Pt|t-1 H

• Joint pdf of P(zt, yt|It-1) :  y | I  Ay  Bu   P P H'   t t1  ~ N t1|t1 t|t1 , t|t1 t|t1         zt | It1  Hyt|t1 Pt|t1H' HPt1|t1H'R     23

Kalman Filter

• Given y0|0 & P0|0 (initialization), the Kalman Filter solves the following equations for t = 1, ..., T

Prediction: yt|t-1 is estimate based on measurements at previous time t: yt|t-1 = A yt-1|t-1 + B ut T Pt|t-1 = A Pt-1 A + Q

Update: yt has additional information – the measurement at time t: yt|t = yt|t-1 + Kt (zt - H yt|t-1) Pt|t = Pt|t-1 - Kt HPt|t-1 = (I - Kt H) Pt|t-1 T T -1 Kt = Pt|t-1 H (H Pt|t-1 H + R) (“Kalman gain”)

• The forecast error is: et|t-1 = zt -zt|t-1 = zt - Hyt|t-1

T 24 • The variance of et|t-1: Ft|t-1= H Pt|t-1 H + R

12 Kalman Filter

• Recall: et|t-1 = zt -zt|t-1 = zt - Hyt|t-1 (forecast error ) T Ft|t-1= H Pt|t-1 H + R (variance of et|t-1)

• The initial values, y0|0 & P0|0, are set to unconditional mean and variance, and reflect prior beliefs about the distribution of yt.

• The update of the state variable, yt|t , and its variance, Pt|t are linear combinations of previous guess and forecast error:

yt|t = yt|t-1 + Kt (zt - H yt|t-1) = yt|t-1 + Kt et|t-1 Pt|t = Pt|t-1 - Kt HPt|t-1 (conditional variance) T T -1 -1 Kt = Pt|t-1 H (H Pt|t-1 H + R) = Cov[zt, yt|It-1] (Ft|t-)

Since we observe zt, the uncertainty (measured by Pt|t) declines.

Note: The bigger Ft|t-1, the smaller Kt and less weight put to updating. 25

Kalman Gain

• Kt depends on the relationship between (zt & yt) and ft|t-1:

T T -1 -1 Kt = Pt|t-1 H (HPt|t-1 H + R) = Cov[zt, yt|It-1] (Ft|t-1)

- The stronger Cov[zt, yt|It-1], the more relevant Kt in the update. - If the relationship is weaker, we do not put much weight as it is likely

not driven by yt. - If we are sure about measurements, R decreases to zero. Kt increases and we weight residuals more heavily than prediction.

• If the model is time-invariant (Q and R are, in fact, constant) the

Kalman gain quickly converges to a constant: Kt → K. In this case, the filter becomes stationary.

26

13 MLE

• Once we have et|t-1 & Ft|t-1, under normality, we can do MLE: T Tn T 1 1 1    ln( 2 )   ln Ft  e't Ft et  ln( f ( z p , z p 1 ,.. z1 ) 2 2 t  p 1 2

Note: Under normality, the KF is optimal in an MSE sense.

• Algorithm (after initialization & ignoring first observations) for (i in 1:T) { zhat = t(H) %*% state10 zvar = (t(H )%*% P10 %*% H + R zvarinv = inv(zvar) eps = t(z[it,.]) - zhat f0 = f0 - ln(det(zvar)) - t(eps) %*% zvarinv %*% eps state11 = state10 + P10 %*% H %*% zvarinv %*% eps P11 = P10 - P10 %*% H %*% zvarinv %*% t(H) %*% P10 state10 = A %*% state11 P10 = A %*% P11 %*% t(A) + Q } f0 = -(T*n/2) * ln(2*pi) + f0/2 27

Derivation of Update

• We wrote the joint of (zt, yt)|It. But, we could have also written the joint of (et, yt)|It:  y | I  y   P P H'  t t1  ~ N t|t1 , t|t1 t|t1      HP F   et | It1   0   t|t1 t|t1  • Recall a property of the multivariate :

If x1 and x2 are jointly normally distributed, the conditional distribution of x1|x2 is also normal with mean 1|2 and variance Σ1|2:

1 1|2  1   12  22 ( x 2   2 ) 1  1|2   11   12  22  21

T -1 Then, yt|t = yt|t-1 + Pt|t-1 H (Ft|t-1) et|t-1

T -1 Pt|t = Pt|t-1 - Pt|t-1 H (Ft|t-1) HPt|t-1 28

14

Correction (Measurement Update) Prediction (Time Update)

(1) Compute the Kalman Gain (1) Project the state ahead T T -1 K = Pt|t-1 H (H Pt|t-1 H + R)

yt|t-1 = Ayt-1|t-1 + But|t-1

(2) Update estimate with measurement zk (2) Project the error covariance ahead yt|t = yt|t-1 + Kt (zt -H ŷt|t-1 ) T Pt|t-1 = APt-1|t-1A + Q (3) Update Error Covariance

Pt|t = (I – Kt H)Pt|t-1

29

Kalman Filter: Remarks

• KF works by minimizing E[(yt –yt|t-1)’ (yt –yt|t-1)]. Under this metric, the is the optimal estimator.

• Conditions for optimality (MSE) –Anderson and Moore (1979): - the DGP (linear system & its state space model) is exactly known.

- noise vector (wt, vt) is . - the noise are known.

• KF is an MSE estimator among all linear , but in the case of a Gaussian model it is the MSE estimator among all estimators.

• In practice, difficult to meet the three conditions. A lot of tuning by ad-hoc methods, to get KF that work “sufficiently well.” 30

15 Kalman Filter: Practical Considerations • Estimating Q and R usually involves NW-style SE, based on . Estimating Q is, in general, complicated. Tuning Q and R (sometimes called ) is usual to improve performance.

• The inversion of Ft|t-1 can be difficult. Usually, the problem comes from having Q singular. In practice, approximations (pseudo- inversion) are used.

• When knowledge/confidence about y0 is low, a diffuse prior for y0 would set P0|0 high.

• The noise vector (wt, vt) should be WN. Autocorrelograms and LB tests can be used to check this.

31

Kalman Filter: Problems and Variations • When the model –i.e., g() and f()– is non-linear, the (EKF) works by linearizing the model (similar to NLLS, A & H are Jacobian matrices of partial derivaties.).

Problem: The distributions of the RVs are no longer normal after their respective nonlinear transformations. EKF is an ad-hoc method.

• When the model is highly non-linear, EKF will not work well. The unscented Kalman filter (UKF), which uses MC method to calculate the updates, works better.

• The KF struggles under non-normality and when the dimensions of the state vector increase. Particle filters (sequential Monte Carlo), which is another Bayesian filter, are very popular in these cases. 32

16 Kalman Filter: Smoothing

• We can use all the data to re-estimate our prediction. That is, yt|T.

• Inputs: Initial distribution y0 and data z1, …, zT

• Algorithm: Forward-backward pass (Rauch-Tung-Striebel algorithm)

- Forward pass:

Kalman filter: compute yt+1|t and yt+1|t+1 for 0 ≤ t < T

- Backward pass:

Compute yt|T for 0 ≤ t < T Reverse process in our intuitive example.

33

Kalman Filter: Smoothing – Backward Pass

• Compute yt|T given yt+1|T ~ N(yt+1|T, Pt+1|T )

- Reverse movement from filter: Xt|t → Xt+1|t

- Same as incorporating measurement in filter

1. Compute joint pdf of (yt|t, yt+1|t) 2. Compute conditional distribution (yt|t | yt+1|t = yt+1)

- But: yt+1 is not “known”, we only know its distribution: 3. “Uncondition” on yt+1 to compute yt|T using laws of total expectation and variance

34

17 Kalman Filter: Smoothing – Backward Pass

• Step 1. Compute joint pdf of (yt|t, yt+1|t)

 y   y   Vary  Covy , y   t|t  ~ N  t|t ,  t|t t|t t1|t   y   y  Cov y , y Var y   t 1|t   t 1|t   t 1|t t|t t 1|t   y   P P AT  ~ N  t|t ,  t|t t|t   y   AP P   t 1|t   t|t t1|t 

• Step 2. Compute conditional pdf of (yt|t|yt+1|t= yt+1)

T 1 T 1 yt|t | yt1|t  yt1  Nyt|t  Pt|t A Pt1|t yt1  yt1|t ,Pt|t  Pt|t A Pt1|t APt|t 

1 where we used the conditional result: 1|2  1   12  22 ( x 2   2 ) 1  1|2   11   12  22  22 35

Kalman Filter: Smoothing – Backward Pass

• Step 3. "Uncondition" on yt+1 to compute yt|T. We do not know its value, but only its distribution: yt+1 ~ N().

• Uncondition on yt+1 to compute yt|T using the Law of total expectation and the Law of total variance:

Law of total expectation: E[X] = EZ[E(X|Y = Z)] E[y ]  E [E(y | y  y )] t|T yt1|T t|t t1|t t1|T Law of total variance:

Var(X) = EZ[Var(X|Y = Z)] + VarZ(E[X|Y = Z]) Var( y )  E [Var( y | y  y )]  t|T yt1|T t|t t 1|t t 1|T Var E[ y | y  y ] yt1|T t|t t 1|t t 1|T 36

18 Kalman Filter: Smoothing – Backward Pass • Step 3 (continuation). From Step 2 we know:

E ( yt|t | yt 1|t  yt 1|T )  yt|t  Lt yt 1|T  yt 1|t 

T Var( yt|t | yt 1|t  yt 1|T )  Pt|t  Lt Pt 1|t Lt Then,

E ( y )  E E ( y | y  y ) t|T yt 1|T t|t t 1|t t 1|T

 yt|t  Lt yt 1|T  yt 1|t

Var( y )  E Var( y | y  y ) t|T yt1|T t|t t 1|t t 1|T Var E ( y | y  y ) yt1|T t|t t 1|t t 1|T T T  Pt|t  Lt Pt 1|t Lt  Lt Pt 1|T Lt T  Pt|t  Lt (Pt 1|T  Pt 1|t )Lt

Kalman Filter: Smoothing – Backward Pass

• Summary: T 1 Lt  Pt|t A Pt1|t

yt|T  yt|t  Lt yt1|T  yt1|t T Pt|T  Pt|t  Lt (Pt1|T  Pt1|t )Lt • Algoritrm: for (i in 1:T-1) { P0T = reshape(Psmo[T-it+1,.],rx,rx) state0T = statesmo[T-it+1,.] Pfilt = reshape(P[T-it,.],rx,rx) # this is P t given t Pnext = A*Pfilt*t(A) + Q # this is P t+1 given t Lt = Pfilt*t(A)'*inv(Pnext) P0T = Pfilt + Lt*(P0T - Pnext)*t(Lt) state0T = t(state[capt-it,.]) + Lt*(chsi0T - A*t(state[T-it,.])) statesmo[T-it,.] = t(state0T) Psmo[T-it,.] = t(vec(P0T)) }

19 Kalman Filter: Smoothing – Algorithm • for (t = 0; t < T; ++t) (Kalman filter)

yt1|t  Ayt|t T Pt1|t  APt|t A  Q T T 1 K t 1  Pt1|t H HPt1|t H  R

yt1|t1  yt1|t  K t1 zt 1  Hyt1|t

Pt1|t1  Pt1|t  K t 1HPt 1|t

• for (t = T – 1; t ≥ 0; --t) (Backward pass)

T 1 Lt  Pt|t A Pt1|t

yt|T  yt|t  Lt yt1|T  yt1|t T Pt|T  Pt|t  Lt (Pt1|T  Pt1|t )Lt

Kalman Filter: Smoothing - Remarks

• Kalman smoother is a post-processing method.

•Use yt|T’s as optimal estimate of state at time t, and use Pt|T as a measure of uncertainty.

• The smoothing recursion consists of the backward recursion that uses the filtered values of y and P.

20