Lecture 3: Particle Filtering and Introduction to Bayesian Inference 1 Recap
Total Page:16
File Type:pdf, Size:1020Kb
EE378A Statistical Signal Processing Lecture 3 - 04/18/2017 Lecture 3: Particle Filtering and Introduction to Bayesian Inference Lecturer: Tsachy Weissman Scribe: Charles Hale In this lecture, we introduce new methods for solving Hidden Markov Processes in cases beyond discrete alphabets. This requires us to learn first about estimating pdfs based on samples from a different distribution. At the end, we introduce the Bayesian inference briefly. 1 Recap: Inference on Hidden Markov Processes (HMPs) 1.1 Setting A quick recap on the setting of a HMP: 1. fXngn≥1 is a Markov process, called \the state process". 2. fYngn≥1 is \the observation process" where Yi is the output of Xi sent through a \memoryless" channel characterized by the distribution PY jX . Alternatively, we showed in HW1 that the above is totally equivalent to the following: 1. fXngn≥1 is \the state process" defined by Xt = ft(Xt−1;Wt) where fWngn≥1 is an independent process, i.e-each Wt is independent of all the others. 2. fYngn≥1 is \the observation process" related to fXngn≥1 by Yt = gt(Xt;Nt) where fNngn≥1 is an independent process. 3. The processes fNngn≥1 and fWngn≥1 are independent of each other. 1.2 Goal Since the state process, fXngn≥1 is \hidden" from us, we only get to observe fYngn≥1 and wish to find a way of estimating fXngn≥1 based on fYngn≥1. To do this, we define the forward recursions for causal estimation: αt(xt) = F (βt;PYtjXt ; yt) (1) βt+1(xt+1) = G(αt;PXt+1jXt ) (2) where F and G are the operators defined in lecture 4. 1.3 Challenges For general alphabets X ; Y, computing the forward recursion is a very difficult problem. However, we know efficient algorithms that can be applied in the following situations: 1. All random variables are discrete with a finite alphabet. We can then solve these equations by brute force in time proportional to the alphabet sizes. 1 2. For t ≥ 1, ft and gt are linear and Nt;Wt;Xt are Gaussian. The solution is the \Kalman Filter" algorithm that we derive in Homework 2. Beyond these special situations, there are many heuristic algorithms for this computation: 1. We can compute the forward recursions approximately by quantizing the alphabets. This leads to a tradeoff between model accuracy and computation cost. 2. This lecture will focus on Particle Filtering, which is a way of estimating the forward recursions with an adaptive quantization. 2 Importance Sampling Before we can learn about Particle Filtering, we first need to discuss importance sampling1. 2.1 Simple Setting Let X1; ··· ;XN be i.i.d. random variables with density f. Now we wish to approximate f by a probability ^ mass function, fN , using our data. The idea: we can approximate f by taking N 1 X f^ = δ(x − X ) (3) N N i i=1 where δ(x) is a Dirac-delta distribution. ^ Claim: fN is a good estimate of f. Here we define the "goodness" of our estimate by looking at [g(X)] (4) EX∼f^N and seeing how close it is to EX∼f [g(X)] (5) for any function g with Ejg(X)j < 1. We show that by this definition, (3) is a good estimate for f. Proof: Z 1 [g(X)] = f^ (x)g(x)dx (6) EX∼f^N N −∞ N Z 1 1 X = δ(x − X )g(x)dx (7) N i −∞ i=1 N 1 X Z 1 = δ(x − X )g(x)dx (8) N i i=1 −∞ N 1 X = g(X ) (9) N i i=1 1See section 2.12.4 of the Spring 2013 lecture notes for further reference. 2 By the law of large numbers, N 1 X a:s: g(X ) ! [g(X)] (10) N i EX∼f i=1 as N ! 1. Hence, [g(X)] ! [g(X)] (11) EX∼f^N EX∼f ^ so fN is indeed a "good" estimate of f. 2.2 General Setting Suppose now that X1; ··· ;XN i.i.d. with common density q, but we wish to estimate a different pdf f. Theorem: The following estimator N ^ X fN (x) = wiδ(x − Xi) (12) i=1 with weights f(Xi) q(Xi) wi = (13) PN f(Xj ) j=1 q(Xj ) is a good estimator of f. Proof: It's easy to see N X E ^ [g(X)] = wig(Xi) (14) X∼f^N i=1 N f(Xi) X q(Xi) = g(Xi) (15) PN f(Xj ) i=1 j=1 q(Xj ) PN f(Xi) i=1 g(Xi) = q(Xi) (16) PN f(Xj ) j=1 q(Xj ) 1 PN f(Xi) i=1 g(Xi) = N q(Xi) : (17) 1 PN f(Xj ) N j=1 q(Xj ) As N ! 1, the above expression converges to f(X) R 1 f(x) E[ g(X)] g(x)q(x)dx q(X) = −∞ q(x) (18) f(X) R 1 f(x) E[ q(X) ] −∞ q(x) q(x)dx R 1 −∞ f(x)g(x)dx = R 1 (19) −∞ f(x)dx Z 1 = f(x)g(x)dx (20) −∞ = EX∼f [g(X)]: (21) 3 Hence, E ^ [g(X)] ! EX∼f [g(X)] (22) X∼f^N as N ! 1. 2.3 Implications The above theorem is useful in situations when we don't know f and cannot draw samples relying on Xi ∼ f, but we know that f(x) = c·h(x) for some constant c (in many practical cases it is computationally prohibitive to obtain the normalization constant c = (R h(x)dx)−1). Then f(Xi) q(Xi) wi = (23) PN f(Xj ) j=1 q(Xj ) c·h(Xi) = q(Xi) (24) PN c·h(Xj ) j=1 q(Xj ) h(Xi) = q(Xi) (25) PN h(Xj ) j=1 q(Xj ) Using these wi's, we see that we can approximate f using samples drawn from a different distribution and h. 2.4 Application Let X ∼ fX and Y be the observation of X through a memoryless channel characterized by fY jX . For MAP reconstruction of X, we must compute fX (x)fY jX (yjx) fXjY (xjy) = R 1 (26) −∞ fX (~x)fY jX (yjx~)dx~ In general, this computation may be hard due to the integral appearing on the bottom. But the denominator is just a constant. Hence, suppose we have Xi i.i.d. ∼ fX for 1 ≤ i ≤ N. By the previous theorem, we can take N ^ X fXjy = wi(y)δ(x − Xi) (27) i=1 as an approximation, where fXjy (Xi) fX (Xi) wi(y) = (28) PN fXjy (Xj ) j=1 fX (Xj ) fX (Xi)fY jX (yjXi) = fX (Xi) (29) PN fX (Xj )fY jX (yjXj ) j=1 fX (Xj ) f (yjXi) = Y jX (30) PN j=1 fY jX (yjXj) Hence, using the samples X1;X2; ··· ;XN we can build a good approximation of fXjY (·|y) for fixed y using the above procedure. 4 Let us denote this approximation as N ^ X fN (X1;X2; :::; XN ; fX ; fY jX ; y) = wiδ(x − Xi) (31) i=1 3 Particle Filtering Particle Filtering is a way to approximate the forward recursions of a HMP using importance sampling to approximate the generated distributions rather than computing the true solutions. (i) N particles generate fX1 gi=1 i.i.d. ∼ fX1 ; for t ≥ 2 do ^ (i) N ^ α^t = FN (fX1 gi=1; βt;PYtjXt ; yt); // estimate αt using importance sampling as in (31) ~ (i) N nextParticles generate fXt gi=1 i.i.d. ∼ α^t; reset particles; for 1 ≤ i ≤ N do (i) ~ (i) particles generate Xt+1 ∼ fXt+1jXt (·|Xt ); end ^ 1 PN βt = N i=1 δ(x − Xt+1); end Algorithm 1: Particle Filter This algorithm requires some art in use. It is not well understood how errors propagate due to these successive approximations. In practice, the parameters need to be periodically reset. Choosing these hyper- parameters is an application-specific task. 4 Inference under logarithmic loss Recall the Bayesian decision theory and let us have the following: • X 2 X is something we want to infer; • X ∼ px (assume that X is discrete); • x^ 2 X^ is something we wish to reconstruct; • Λ: X × X!^ R is our loss function. First we assume that we do not have any observation, and define the Bayesian response as X U(px) = min EΛ(X; x^) = min px(s)Λ(x; x^) (32) x^ x^ x and ^ XBayes(pX ) = arg min EΛ(X; x^): (33) x^ Example: Let X = X^ = R, Λ(x; x^) = (x − x^)2. Then 2 U(pX ) = min EX∼pX (X − x^) = Var(X) (34) x^ ^ XBayes(pX ) = E[X]: (35) 5 Consider now the case we have some observation (side information) Y , where our reconstruction X^ can be a function of Y . Then EΛ(X; X^(Y )) = E[E[Λ(X; X^(y))jY = y]] (36) X ^ = E[Λ(X; X(y))jY = y]pY (y) (37) y X X ^ = pY (y) pXjY =y(x)Λ(x; X(y)) (38) y x This implies ^ U(pX ) = min EΛ(X; X(Y )) (39) X^ (·) X X ^ = min pY (y) pXjY =y(x)Λ(x; X(y)) (40) ^ X(·) y x X = pY (y)U(pXjY =y) (41) y and thus ^ ^ Xopt(y) = XBayes(pXjY =y) (42) 6.