<<

EE378A Statistical Lecture 3 - 04/18/2017 Lecture 3: Particle Filtering and Introduction to Lecturer: Tsachy Weissman Scribe: Charles Hale

In this lecture, we introduce new methods for solving Hidden Markov Processes in cases beyond discrete alphabets. This requires us to learn first about estimating pdfs based on samples from a different distribution. At the end, we introduce the Bayesian inference briefly.

1 Recap: Inference on Hidden Markov Processes (HMPs) 1.1 Setting A quick recap on the setting of a HMP:

1. {Xn}n≥1 is a Markov process, called “the state process”.

2. {Yn}n≥1 is “the observation process” where Yi is the output of Xi sent through a “memoryless” channel characterized by the distribution PY |X . Alternatively, we showed in HW1 that the above is totally equivalent to the following:

1. {Xn}n≥1 is “the state process” defined by

Xt = ft(Xt−1,Wt)

where {Wn}n≥1 is an independent process, i.e-each Wt is independent of all the others.

2. {Yn}n≥1 is “the observation process” related to {Xn}n≥1 by

Yt = gt(Xt,Nt)

where {Nn}n≥1 is an independent process.

3. The processes {Nn}n≥1 and {Wn}n≥1 are independent of each other.

1.2 Goal

Since the state process, {Xn}n≥1 is “hidden” from us, we only get to observe {Yn}n≥1 and wish to find a way of estimating {Xn}n≥1 based on {Yn}n≥1. To do this, we define the forward recursions for causal estimation:

αt(xt) = F (βt,PYt|Xt , yt) (1)

βt+1(xt+1) = G(αt,PXt+1|Xt ) (2)

where F and G are the operators defined in lecture 4.

1.3 Challenges For general alphabets X , Y, computing the forward recursion is a very difficult problem. However, we know efficient algorithms that can be applied in the following situations: 1. All random variables are discrete with a finite alphabet. We can then solve these equations by brute force in time proportional to the alphabet sizes.

1 2. For t ≥ 1, ft and gt are linear and Nt,Wt,Xt are Gaussian. The solution is the “” algorithm that we derive in Homework 2. Beyond these special situations, there are many heuristic algorithms for this computation: 1. We can compute the forward recursions approximately by quantizing the alphabets. This leads to a tradeoff between model accuracy and computation cost.

2. This lecture will focus on Particle Filtering, which is a way of estimating the forward recursions with an adaptive quantization.

2 Importance

Before we can learn about Particle Filtering, we first need to discuss importance sampling1.

2.1 Simple Setting

Let X1, ··· ,XN be i.i.d. random variables with density f. Now we wish to approximate f by a probability ˆ mass function, fN , using our data. The idea: we can approximate f by taking

N 1 X fˆ = δ(x − X ) (3) N N i i=1

where δ(x) is a Dirac-delta distribution.

ˆ Claim: fN is a good estimate of f.

Here we define the ”goodness” of our estimate by looking at

[g(X)] (4) EX∼fˆN and seeing how close it is to

EX∼f [g(X)] (5)

for any function g with E|g(X)| < ∞. We show that by this definition, (3) is a good estimate for f. Proof:

Z ∞ [g(X)] = fˆ (x)g(x)dx (6) EX∼fˆN N −∞ N Z ∞ 1 X = δ(x − X )g(x)dx (7) N i −∞ i=1 N 1 X Z ∞ = δ(x − X )g(x)dx (8) N i i=1 −∞ N 1 X = g(X ) (9) N i i=1

1See section 2.12.4 of the Spring 2013 lecture notes for further reference.

2 By the ,

N 1 X a.s. g(X ) → [g(X)] (10) N i EX∼f i=1 as N → ∞. Hence,

[g(X)] → [g(X)] (11) EX∼fˆN EX∼f ˆ so fN is indeed a ”good” estimate of f.

2.2 General Setting

Suppose now that X1, ··· ,XN i.i.d. with common density q, but we wish to estimate a different pdf f.

Theorem: The following

N ˆ X fN (x) = wiδ(x − Xi) (12) i=1 with weights

f(Xi) q(Xi) wi = (13) PN f(Xj ) j=1 q(Xj ) is a good estimator of f. Proof: It’s easy to see

N X E ˆ [g(X)] = wig(Xi) (14) X∼fˆN i=1

N f(Xi) X q(Xi) = g(Xi) (15) PN f(Xj ) i=1 j=1 q(Xj ) PN f(Xi) i=1 g(Xi) = q(Xi) (16) PN f(Xj ) j=1 q(Xj ) 1 PN f(Xi) i=1 g(Xi) = N q(Xi) . (17) 1 PN f(Xj ) N j=1 q(Xj ) As N → ∞, the above expression converges to

f(X) R ∞ f(x) E[ g(X)] g(x)q(x)dx q(X) = −∞ q(x) (18) f(X) R ∞ f(x) E[ q(X) ] −∞ q(x) q(x)dx R ∞ −∞ f(x)g(x)dx = R ∞ (19) −∞ f(x)dx Z ∞ = f(x)g(x)dx (20) −∞ = EX∼f [g(X)]. (21)

3 Hence,

E ˆ [g(X)] → EX∼f [g(X)] (22) X∼fˆN as N → ∞.

2.3 Implications

The above theorem is useful in situations when we don’t know f and cannot draw samples relying on Xi ∼ f, but we know that f(x) = c·h(x) for some constant c (in many practical cases it is computationally prohibitive to obtain the normalization constant c = (R h(x)dx)−1). Then

f(Xi) q(Xi) wi = (23) PN f(Xj ) j=1 q(Xj )

c·h(Xi) = q(Xi) (24) PN c·h(Xj ) j=1 q(Xj )

h(Xi) = q(Xi) (25) PN h(Xj ) j=1 q(Xj )

Using these wi’s, we see that we can approximate f using samples drawn from a different distribution and h.

2.4 Application

Let X ∼ fX and Y be the observation of X through a memoryless channel characterized by fY |X . For MAP reconstruction of X, we must compute

fX (x)fY |X (y|x) fX|Y (x|y) = R ∞ (26) −∞ fX (˜x)fY |X (y|x˜)dx˜ In general, this computation may be hard due to the integral appearing on the bottom. But the denominator is just a constant. Hence, suppose we have Xi i.i.d. ∼ fX for 1 ≤ i ≤ N. By the previous theorem, we can take

N ˆ X fX|y = wi(y)δ(x − Xi) (27) i=1 as an approximation, where

fX|y (Xi) fX (Xi) wi(y) = (28) PN fX|y (Xj ) j=1 fX (Xj )

fX (Xi)fY |X (y|Xi) = fX (Xi) (29) PN fX (Xj )fY |X (y|Xj ) j=1 fX (Xj )

f (y|Xi) = Y |X (30) PN j=1 fY |X (y|Xj)

Hence, using the samples X1,X2, ··· ,XN we can build a good approximation of fX|Y (·|y) for fixed y using the above procedure.

4 Let us denote this approximation as

N ˆ X fN (X1,X2, ..., XN , fX , fY |X , y) = wiδ(x − Xi) (31) i=1 3 Particle Filtering

Particle Filtering is a way to approximate the forward recursions of a HMP using to approximate the generated distributions rather than computing the true solutions. (i) N particles ← generate {X1 }i=1 i.i.d. ∼ fX1 ; for t ≥ 2 do ˆ (i) N ˆ αˆt = FN ({X1 }i=1, βt,PYt|Xt , yt); // estimate αt using importance sampling as in (31) ˜ (i) N nextParticles ← generate {Xt }i=1 i.i.d. ∼ αˆt; reset particles; for 1 ≤ i ≤ N do (i) ˜ (i) particles ← generate Xt+1 ∼ fXt+1|Xt (·|Xt ); end ˆ 1 PN βt = N i=1 δ(x − Xt+1); end Algorithm 1: Particle Filter This algorithm requires some art in use. It is not well understood how errors propagate due to these successive approximations. In practice, the parameters need to be periodically reset. Choosing these hyper- parameters is an application-specific task.

4 Inference under logarithmic loss

Recall the Bayesian decision theory and let us have the following: • X ∈ X is something we want to infer;

• X ∼ px (assume that X is discrete); • xˆ ∈ Xˆ is something we wish to reconstruct;

• Λ: X × Xˆ → R is our . First we assume that we do not have any observation, and define the Bayesian response as X U(px) = min EΛ(X, xˆ) = min px(s)Λ(x, xˆ) (32) xˆ xˆ x and ˆ XBayes(pX ) = arg min EΛ(X, xˆ). (33) xˆ

Example: Let X = Xˆ = R, Λ(x, xˆ) = (x − xˆ)2. Then

2 U(pX ) = min EX∼pX (X − xˆ) = Var(X) (34) xˆ ˆ XBayes(pX ) = E[X]. (35)

5 Consider now the case we have some observation (side information) Y , where our reconstruction Xˆ can be a function of Y . Then

EΛ(X, Xˆ(Y )) = E[E[Λ(X, Xˆ(y))|Y = y]] (36) X ˆ = E[Λ(X, X(y))|Y = y]pY (y) (37) y X X ˆ = pY (y) pX|Y =y(x)Λ(x, X(y)) (38) y x

This implies ˆ U(pX ) = min EΛ(X, X(Y )) (39) Xˆ (·) X X ˆ = min pY (y) pX|Y =y(x)Λ(x, X(y)) (40) ˆ X(·) y x X = pY (y)U(pX|Y =y) (41) y and thus ˆ ˆ Xopt(y) = XBayes(pX|Y =y) (42)

6