Robert Collins Penn State
Sampling Methods: Particle Filtering
CSE586 Computer Vision II CSE Dept, Penn State Univ Robert Collins Penn State Recall: Importance Sampling
Procedure to estimate EP(f(x)):
1) Generate N samples xi from Q(x) 2) form importance weights
3) compute empirical estimate of EP(f(x)), the expected value of f(x) under distribution P(x), as Robert Collins Penn State Resampling
Note: We thus have a set of weighted samples (xi, wi | i=1,…,N) If we really need random samples from P, we can generate them
by resampling such that the likelihood of choosing value xi is proportional to its weight wi This would now involve now sampling from a discrete
distribution of N possible values (the N values of xi ) Therefore, regardless of the dimensionality of vector x, we are resampling from a 1D distribution (we are essentially sampling from the indices 1...N, in proportion to the
importance weights wi). So we can using the inverse transform sampling method we discussed earlier. Robert Collins Penn State Sequential Monte Carlo Methods
Sequential Importance Sampling (SIS) and the closely related algorithm Sampling Importance Sampling (SIR) are known by various names in the literature: - bootstrap filtering - particle filtering - Condensation algorithm - survival of the fittest General idea: Importance sampling on time series data, with samples and weights updated as each new data term is observed. Well-suited for simulating recursive Bayes filtering! Robert Collins Penn State Recall: Bayes Filtering Two-step Iteration at Each Time t: Motion Prediction Step:
Data Correction Step (Bayes rule): Robert Collins Penn State Recall: Bayes Filtering Problem: in general we get intractible integrals Motion Prediction Step:
Data Correction Step (Bayes rule): Robert Collins Penn State Sequential Monte Carlo Methods
Intuition: • Represent probability distributions by samples (called particles). • Each particle is a “guess” at the true state. • For each one, simulate it’s motion update and add noise to get a motion prediction. Measure the likelihood of this prediction, and weight the resulting particles proportional to their likelihoods. Robert Collins Penn State Back to Bayes Filtering
This integral in the denominator of Bayes rule disappears as a consequence of representing distributions by a weighted set of samples. Since we have only a finite number of samples, the normalization constant will be the sum of the weights!
Data Correction Step (Bayes rule): Robert Collins Penn State Back to Bayes Filtering Now let’s write the Bayes filter by combining motion prediction and data correction steps into one equation.
new posterior data term motion term old posterior Robert Collins Penn State Monte Carlo Bayes Filtering
Assume the posterior at time t-1 (which is the prior at time t) has been approximated as a set of N weighted particles:
So that
Where is the delta dirac function
Useful property: Robert Collins Penn State Monte Carlo Bayes Filtering
Then the motion prediction integral simplifies to a summation
Motion prediction integral
The prior had been approximated by N particles
Exchange order of summation and integration
Property of Dirac delta function
Robert Collins Penn State Monte Carlo Bayes Filtering
Our Bayes filtering equation thus simplifies as well
Plugging in result from previous page
Bringing term that doesn’t depend on i into the summation
Robert Collins Penn State Monte Carlo Bayes Filtering
Our new posterior is therefore
but this is still not amenable to computation in closed-form for arbitrary motion models and likelihood functions (e.g. we would have to integrate it to compute the normalization constant c) Idea : Let’s approximate the posterior as a set of N samples!
Idea 2 : Hey wait a minute, the prior was already represented as a set of N samples! Why don’t we just “update” each of those? Robert Collins Penn State Monte Carlo Bayes Filtering
i i Approach: for each sample x t-1 , generate a new sample x t from by importance sampling using some convenient proposal distribution
So, generate a sample
and compute its importance weight Robert Collins Penn State Monte Carlo Bayes Filtering
We then can approximate our posterior as
where Robert Collins Penn State SIS Algorithm Robert Collins Penn State SIS Degeneracy Unfortunately, pure SIS suffers from degeneracy. In many cases, after a few iterations, all but one particle will have negligible weight.
Illustration of degeneracy: Time 1 w
Time 10 w
Time 19 w Robert Collins Penn State Resampling to Combat Degeneracy
Sampling with replacement to get N new samples, each having equal weight 1/N
Samples with high weight get replicated
Samples with low weight die off
Concentrates particles in areas of higher probability Robert Collins Penn State Generic Particle Filter Robert Collins Penn State Sample Importance Resample (SIR) SIR is a special case of the generic particle filter where: - the prior density is used as the proposal density - resampling is done every iteration
therefore
and thus
cancellation
the old weights are all equal due to resampling
Robert Collins Penn State SIR Algorithm Robert Collins Penn State Drawing from the Prior Density
note, when we use the prior as the importance density, we only need to sample from the process noise distribution (typically uniform or Gaussian).
Why? Recall: xk = fk (xk-1, vk-1) v is process noise
Thus we can sample from the prior P(xk | xk-1) by starting with i i sample x k-1, generating a noise vector v k-1 from the noise process, and forming the noisy sample
i i i x k = fk (x k-1, v k-1)
If the noise is additive, this leads to a very simple interpretation: move each particle using motion prediction, then add noise. Robert Collins Penn State SIR Filtering Illustration
M (m) 1 xk 2 , M M ~ (m) 1 m 1 xk 1, M m 1 (m) (m) M xk 1 ,wk 1 m 1
M (m) 1 xk 1 , M M ~ (m) 1 m 1 xk , M m 1 (m) (m) M xk ,wk m 1
M (m) 1 xk 1 , M m 1 x Robert Collins Penn State Problems with SIS/SIR
Degeneracy: in SIS, after several iterations all samples except one tend to have negligible weight. Thus a lot of computational effort is spent on particles that make no contribution. Resampling is supposed to fix this, but also causes a problem... Sample Impoverishment: in SIR, after several iterations all samples tend to collapse into a single state. The ability to representation multimodal distributions is thus short-lived. Robert Collins CSE598G Particle Filter Failure Analysis
References King and Forsyth, “How Does CONDENSATION Behave with a Finite Number of Samples?” ECCV 2000, 695-709. Karlin and Taylor, A First Course in Stochastic Processes, 2nd edition, Academic Press, 1975. Robert Collins CSE598G Particle Filter Failure Analysis Summary Condensation/SIR is aymptotically correct as the number of samples tends towards infinity. However, as a practical issue, it has to be run with a finite number of samples. Iterations of Condensation form a Markov chain whose state space is quantized representations of a density. This Markov chain has some undesirable properties • high variance - different runs can lead to very different answers • low apparent variance within each individual run (appears stable) • state can collapse to single peak in time roughly linear in number of samples • tracker may appear to follow peaks in the posterior even in the absence of any meaningful measurements. These properties generally known as “sample impoverishment” Robert Collins CSE598G Stationary Analysis
For simplicity, we focus on tracking problems with stationary distributions (posterior should be the same at any time step).
[because it is hard to really focus on what is going on when the posterior modes are deterministically moving around. Any movement of modes in our analysis will be due to behavior of the particle filter] Robert Collins CSE598G A Simple PMF State Space
Consider 10 particles representing a probability mass function over 2 locations. PMF state space: {(0,10)(1,9)(2,8)(3,7)(4,6)(5,5) (4,6) (6,4)(7,3)(8,2)(9,1)(10,0)} 1 2
We will now instantiate a particular two-state filtering model that we can analyze in closed-form, and explore the Markov chain process (on the PMF state space above) that describes how particle filtering performs on that process. Robert Collins CSE598G Discrete, Stationary, No Noise
Assume a stationary process model with no-noise
process model: Xk+1 = F Xk + vk I 0 Identity no noise
process model: Xk+1 = Xk Robert Collins CSE598G Perfect Two-State Ambiguity
Let our two filtering states be {a,b}. We define both prior distribution and observation model to be ambiguous (equal belief in a and b).
.5 X0 = a .5 X0 = a P(X 0) = P(Z|Xk) = .5 X0 = b .5 X0 = b
from process model: a b a 1 0 P(Xk+1 | Xk) = b 0 1 Robert Collins CSE598G Recall: Recursive Filtering Prediction:
predicted current state state transition previous estimated state
Update: measurement predicted current state estimated current state
normalization term
These are exact propagation equations. Robert Collins CSE598G Analytic Filter Analysis
Predict
1 .5 0 .5 = .5 0 .5 1 .5 = .5
Update
.5 .5 = .25/(.25+.25) = .5 .5 .5 = .25/(.25+.25) = .5 Robert Collins CSE598G Analytic Filter Analysis
Therefore, for all k, the posterior distribution is
.5 Xk = a P(Xk | z1:k) = .5 Xk = b which agrees with our intuition in regards to the stationarity and ambiguity of our two-state model.
Now let’s see how a particle filter behaves... Robert Collins CSE598G Particle Filter
Consider 10 particles representing a probability mass function over our 2 locations {a,b}.
In accordance with our ambiguous prior, we will initialize with 5 particles in each location
P(X0) =
a b Robert Collins CSE598G Condensation (SIR) Particle Filter
1) Select N new samples with replacement, according to the sample weights (equal weights in this case) 2) Apply process model to each sample (deterministic motion + noise) (no-op in this case) 3) For each new position, set weight of particle in accordance to observation probability (all weights become .5 in this case) 4) Normalize weights so they sum to one (weights are still equal ) Robert Collins CSE598G Condensation as Markov Chain (Key Step) Recall that 10 particles representing a probability mass function over 2 locations can be thought of as having a state space with 11 elements: {(0,10)(1,9)(2,8)(3,7)(4,6)(5,5)(6,4)(7,3)(8,2)(9,1)(10,0)}
(5,5)
a b Robert Collins CSE598G Condensation as Markov Chain (Key Step) We want to characterize the probability that the particle filter procedure will transition from the current configuration to a new configuration: {(0,10)(1,9)(2,8)(3,7)(4,6)(5,5)(6,4)(7,3)(8,2)(9,1)(10,0)}
?
(5,5)
a b Robert Collins CSE598G Condensation as Markov Chain (Key Step) We want to characterize the probability that the particle filter procedure will transition from the current configuration to a new configuration: {(0,10)(1,9)(2,8)(3,7)(4,6)(5,5)(6,4)(7,3)(8,2)(9,1)(10,0)}
?
(5,5) Let P(j | i) be prob of transitioning from (i,10-i) to (j,10-j) a b Robert Collins CSE598G Example
N=10 samples
(4,6) .2051 a b (5,5) .1172 a b .2461 .2051 (3,7) a b
.25 P( j | 5) (6,4) (5,5) a b a b 0 0 j 10 Robert Collins CSE598G Full Transition Table P( j | i) 0
.25 P( j | 5) i 0 0 j 1 0
10 0 j 10 Robert Collins CSE598G The Crux of the Problem
P(j|5) from (5,5), there is a good chance we will jump to away from (5,5), say to (6,4)
P(j|6) once we do that, we are no longer sampling from the transition distribution at (5,5), but from the one at (6,4). But this is biased off center from (5,5)
P(j|7) and so on. The behavior will be similar to that of a random walk. Robert Collins CSE598G Another Problem P( j | i) 0 P(0|0) = 1
(0,10) and (10,0) i are absorbing states!
10 P(10|10) = 1 0 j 10 Robert Collins CSE598G Observations
• The Markov chain has two absorbing states (0,10) and (10,0)
• Once the chain gets into either of these two states, it can never get out (all the particles have collapsed into a single bucket)
• There is a nonzero probability of getting into either absorbing state, starting from (5,5)
These are the seeds of our destruction! Robert Collins CSE598G Simulation Robert Collins CSE598G Some sample runs with 10 particles Robert Collins CSE598G More Sample Runs N=10
N=20
N=100 Robert Collins
CSE598G Average Time to Absorbtion
average time to absorbtion time average number of particles N
Dots - from running simulator (100 trials at N=10,20,30...) Line - plot of 1.4 N, the asymptotic analytic estimate (King and Forsyth) Robert Collins CSE598G More Generally
Implications of stationary process model with no noise, in a discrete state space.
• any time any bucket contains zero particles, it will forever after have zero particles (for that run).
• there is typically a nonzero probability of getting zero particles in a bucket sometime during the run.
• thus, over time, the particles will inevitably collapse into a single bucket. Robert Collins CSE598G Extending to Continuous Case
A similar thing happens in more realistic cases. Consider a continuous case with two stationary modes in the likelihood, and where each mode has small variance with respect to distance between modes.
mode1 mode2 Robert Collins CSE598G Extending to Continuous Case
The very low variance between modes is fatal to any particles that try to cross from one to the other via diffusion.
mode1 mode2 Robert Collins CSE598G Extending to Continuous Case
Each mode thus becomes an isolated island, and we can reduce this case to our previous two-state analysis (each mode is one discrete state)
mode1 mode2
a b