MONTE CARLO METHODS FOR STRUCTURED DATA
A DISSERTATION SUBMITTED TO THE INSTITUTE FOR COMPUTATIONAL AND MATHEMATICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
Adam Guetz January 2012
© 2012 by Adam Nathan Guetz. All Rights Reserved. Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/rg833nw3954
ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Susan Holmes, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Amin Saberi, Co-Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Peter Glynn
Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.
iii Abstract
Recent years has seen an increased need for modeling of rich data across many engi- neering and scientific disciplines. Much of this data contains structure, or non-trivial relationships between elements, that should be exploited when performing statistical inference. Sampling from and fitting complicated models present challenging com- putational issues, and available deterministic heuristics may be ineffective. Monte Carlo methods present an attractive framework for finding approximate solutions to these problems. This thesis covers two closely related techniques: adaptive impor- tance sampling, and sequential Monte Carlo. Both of these methods make use of sampling-importance resampling to generate approximate samples from distributions of interest. Sequential importance sampling is well known to have difficulties in high-dimensional settings. I present a technique called conditional sampling-importance resampling, an extension of sampling importance resampling to conditional distributions that improves performance, particularly when independence structure is present. The primary application is to multi-object tracking for a colony of harvester ants in a laboratory setting. Previous approaches tend to make simplifying parametric as- sumptions on the model in order to make computations more tractable, while the approach presented finds approximate solutions to more complicated and realistic models. To analyze structural properties of networks, I expand adaptive importance sampling techniques to the analysis of network growth models such as preferential attachment, using the Plackett-Luce family of distributions on permutations, and I present an application of sequential Monte Carlo to a special form of network growth model called vertex censored stochastic Kronecker product graphs.
iv Acknowledgements
I’d like to thank my wife Heidi Lubin, my son Levi, my principal advisor Susan Holmes, my co-advisor Amin Saberi, my parents, and all of my friends and extended family.
v Contents
Abstract iv
Acknowledgements v
1 Introduction 1 1.1 Monte Carlo Integration ...... 2 1.2 Applications ...... 3
2 Approximate Sampling 6 2.1 Importance Sampling ...... 7 2.1.1 Effective Sample Size ...... 9 2.1.2 Sampling Importance Resampling ...... 11 2.2 Markov Chain Monte Carlo ...... 13 2.2.1 Markov Chains ...... 13 2.2.2 Metropolis Hastings ...... 14 2.2.3 Gibbs Sampler ...... 14 2.2.4 Data Augmentation ...... 16 2.2.5 Hit-and-Run ...... 16
3 Sequential Monte Carlo 18 3.1 Sequential Models ...... 18 3.2 Sequential Importance Sampling ...... 20 3.3 Particle Filter ...... 22
vi 4 Adaptive Importance Sampling 24 4.1 Background ...... 24 4.1.1 Variance Minimization ...... 25 4.1.2 Cross-Entropy Method ...... 26 4.2 Avoiding Degeneracy ...... 28 4.3 Related Methods ...... 30 4.3.1 Annealed Importance Sampling ...... 30 4.3.2 Population Monte Carlo ...... 31
5 Conditional Sampling Importance Resampling 33 5.1 Motivation ...... 33 5.2 Conditional Resampling ...... 34 5.2.1 Estimating Marginal Importance Weights ...... 36 5.2.2 Conditional Effective Sample Size ...... 36 5.2.3 Importance Weight Accounting ...... 37 5.3 Example: Multivariate Normal ...... 38
6 Multi-Object Particle Tracking 43 6.1 Background ...... 43 6.1.1 Single Object Tracking ...... 43 6.1.2 Multi Object Tracking ...... 45 6.1.3 Tracking Notation ...... 46 6.2 Conditional SIR Particle Tracking ...... 47 6.2.1 Grouping Subsets for Multi-Object Tracking ...... 48 6.3 Application: Tracking Harvester Ants ...... 49 6.3.1 Object Detection ...... 49 6.3.2 Observation Model ...... 51 6.3.3 State-Space Model ...... 53 6.3.4 Importance Distribution ...... 54 6.3.5 Computing Relative and Marginal Importance Weights . . . . 62 6.4 Empirical Results ...... 64 6.4.1 Simulated Data ...... 64
vii 6.4.2 Short Harvester Ant Video ...... 65
7 Network Growth Models 70 7.1 Background ...... 71 7.1.1 Erd¨os-R´enyi ...... 73 7.1.2 Preferential Attachment ...... 73 7.1.3 Duplication/Divergence ...... 75 7.2 Computing Likelihoods with Adaptive Importance Sampling . . . . . 75 7.2.1 Marginalizing Vertex Ordering ...... 78 7.2.2 Plackett-Luce Model as an Importance Distribution ...... 79 7.2.3 Choice of Description Length Function ...... 80 7.3 Examples ...... 81 7.3.1 Modified Preferential Attachment Model ...... 81 7.3.2 Adaptive Importance sampling ...... 82 7.3.3 Annealed Importance sampling ...... 82 7.3.4 Computational Effort ...... 83 7.3.5 Numerical Results ...... 84
8 Kronecker Product Graphs 91 8.1 Motivation ...... 92 8.2 Stochastic Kronecker Product Graph model ...... 94 8.2.1 Likelihood under Stochastic Kronecker Product Graph model . 94 8.2.2 Sampling Permutations ...... 96 8.2.3 Computing Gradients ...... 96 8.3 Vertex Censored Stochastic Kronecker Product Graphs ...... 97 8.3.1 Importance Sampling for Likelihoods ...... 98 8.3.2 Choosing Censored Vertices ...... 100 8.3.3 Sampling Permutations ...... 100 8.3.4 Multiplicative Attribute Graphs ...... 101 8.4 Empirical Results ...... 101 8.4.1 Implementation ...... 102
viii List of Tables
6.1 Observation event types...... 53
7.1 Comparison of estimators for sparse 500 node preferential attachment dataset from Figure 7.1 ...... 84 7.2 Comparison of estimators for dataset: 5 networks, 30 nodes each, av- erage degree 2, 20 samples each method ...... 86 7.3 Comparison of estimators for dataset: 2 networks, 100 nodes each, average degree 2 ...... 86 7.4 Estimated log-likelihoods for Mus Musculus protein-protein interaction networks ...... 87
ix List of Figures
3.1 Dependence structure of hidden Markov models ...... 19
5.1 CSIR Normal example: eigenvalues of covariance matrices ...... 40 5.2 CSIR Normal example: estimate KL-Divergences ...... 41 5.3 Same experiments as in Figure 5.2, plotted by method...... 42
6.1 Example grouping subset functions ...... 49 6.2 Blob bisection via spectral partitioning ...... 52 6.3 Association of objects with observations. ’Events’ correspond to con- nected components in this bipartite graph, including Normal obser- vations, splitting, merging, false positives, false negatives, and joint events...... 57 6.4 “True” distribution of path lengths and trajectories per frame, simu- lated example...... 66 6.5 Centroid observations per frame, simulated example...... 66 6.6 Distribution of path lengths and trajectories per frame using a sample from the importance distribution, simulated example...... 67 6.7 Distribution of path lengths and trajectories per frame using CSIR, simulated example...... 67 6.8 GemVident screenshot, showing centroids...... 68 6.9 Centroid observations per frame from Harvester ant example...... 68 6.10 Distribution of path lengths and trajectories per frame using a sample from the importance distribution, Harvester ant example...... 69
x 6.11 Distribution of path lengths and trajectories per frame using CSIR, Harvester ant example...... 69
7.1 Example runs comparing annealed importance sampling and adaptive importance sampling ...... 85 7.2 Likelihoods and importance weights for cross-entropy method. . . . . 88 7.3 Mus. Musculus (common mouse) PPI network...... 89 7.4 Convergence of adaptive importance sampling and annealed impor- tance sampling for Mus. Musculus PPI network...... 90
8.1 Comparison of crude and SIS Monte Carlo for Kronecker graph likeli- hoods...... 103 8.2 Comparison of SKPG and VCSKPG models for AS-ROUTEVIEWS graph ...... 104
xi Chapter 1
Introduction
Contemporary data analysis is often comprised of information with complicated and high-dimensional relationships between elements. Traditional, deterministic analytic techniques are often unable to directly cope with the computational challenge, and must make simplifying assumptions or heuristic approximations. An attractive al- ternate is the suite of randomized methods known as Monte Carlo. The types of problems examined in this thesis often contain both discrete and continuous compo- nents, and can generally be expressed as or related to integral or summation type problems. Suppose one wishes to compute some quantity µ defined as Z µ = X(ω)P(dω). (1.1) Ω
If X :Ω 7→ R is the random variable defined on probability space (Ω, Σ, P), then this can be equivalently expressed as the expected value
µ = E[X]. (1.2)
In some cases, µ can be computed exactly using analytic techniques. For many examples this is not possible and one must resort to methods of approximation. Deter- ministic numerical integration, or quadrature, generally has good convergence prop- erties for low and moderate dimensional integrals. However, the computational com- plexity of quadrature increases exponentially in the dimension of the sample space
1 CHAPTER 1. INTRODUCTION 2
Ω, making high-dimensional inference computationally intractable. This general phe- nomenon is known as the curse of dimensionality [11], and can be explained in terms of the relative “sparseness” of high-dimensional space. Monte Carlo integration can be a viable alternative to quadrature in these settings, as it converges at a rate pro- portion to the square-root of the sample size regardless of dimension.
1.1 Monte Carlo Integration
ˆ Given N independent identically distributed random variables X1,..., XN , XN = PN 2 ( i=1 Xi)/N, E[X ] < ∞, the strong law of large numbers gives
ˆ a.s. XN −−→ µ, (1.3)
a.s. where −−→ denotes almost sure convergence. This provides the motivation to usex ˆN as an approximation to µ. Usingx ˆN to approximate µ is known as Monte Carlo Integration. Notationally, in this thesis bold uppercase letters such as X indicate random variables, while lowercase letters such as x indicate observations of those random variables. Under the above conditions, the central limit theorem states that
ˆ d 2 XN − µ −→N (0, σX/N), (1.4) where −→d denotes convergence in distribution, and σ2 = var[X]. Roughly speaking, X √ ˆ this means that in the limit convergence of XN to µ occurs at a N rate. In other words, to get another digit of accuracy (factor of 10), one would need 102 = 100 times as many samples. While this rate of convergence is unappealing for low-dimensional integrals, the rate holds regardless of the sample space Ω, therefore allowing one to circumvent the curse of dimensionality for high-dimensional problems. This formulation, however, hides some of the additional complexity inherent in 2 Monte Carlo integration. One difficulty is that the variance σX may grow exponen- tially in the dimensions of the sample space Ω. In these cases, the Monte Carlo CHAPTER 1. INTRODUCTION 3
guarantee of quadratic convergence is unhelpful since the starting point is an esti- mator with exponentially high variance. To make Monte Carlo methods practical in this case, it is necessary to employ variance reduction techniques. For comprehensive reviews of variance reduction techniques, see Liu [68] or Asmussen and Glynn [6]. The primary method for variance reduction used in this text is importance sampling, where instead of directly sampling the random variable X to estimate (1.1), one in- stead samples from some biased random variable Y and corrects for the bias. See §2.1 for background on importance sampling. Another potential source of complexity is in the generation of independent random draws from the sample space ω ∈ Ω and the computation of the random variable X(ω). For many commonly occurring problems, the best known algorithms for sampling exactly from the sample space of interest take exponential (or worse) time. In these cases the only feasible alternative is to use approximate sampling techniques. In this thesis, I will examine two main techniques for approximate sampling: sampling importance resampling(SIR) and Markov chain Monte Carlo(MCMC). Background for these techniques is covered in §2. Advanced topics covered include sequential Monte Carlo §3, including particle filtering, and adaptive importance sampling §4.
1.2 Applications
In §6, techniques are discussed for tracking large numbers of possibly interacting ob- jects. In multi-object tracking (also known as multi-target tracking), the sequential process of interest can be represented as a hidden Markov model (HMM). The primary goal is to make inferences about the hidden state. There are many techniques avail- able for tracking multiple targets, but almost all of those currently available either make broad simplifying assumptions or are infeasible for large problem sizes. Gener- ally it is preferable to use models of movement and observation that are as realistic as possible while still permitting scalable analysis. One can generally state the problem of inference as being equivalent to sampling from a posterior distribution, but in the tracking models studied in this thesis it is generally impossible to make exact draws from the posterior distribution. Instead, sequential Monte Carlo techniques are used CHAPTER 1. INTRODUCTION 4
to draw approximate samples. The standard particle filter doesn’t typically work well with high-dimensional models due to degeneracy issues, where the bulk of the importance weight mass is concentrated on a small number of particles. Part of the degeneracy issue can be resolved using standard SIR, however in the high-dimensional multi-target tracking example studied in chapter §6 it is insufficient. To address this issue, chapter §5 introduces a conditional sampling importance resampling step that takes advantage of inherent independence structures in the model. When individ- ual targets are far from one another, conditional SIR effectively admits a separate particle filter for each target while asymptotically maintaining the correct joint dis- tribution. An implementation of this algorithm with empirical examples of tracking the movements of harvester ants in a laboratory setting is given. Adaptive importance sampling is a technique similar in many respects to the par- ticle filter. In §7 an application of adaptive importance sampling to inference for network growth models is studied. Network growth models are models of network creation in which a new vertex arrives and attaches edges to pre-existing vertices according to a rule depending on the current state of the network. The application considered, it is attempted to estimate the likelihood of a given network having origi- nated from the growth model for the purpose of model selection. A primary technical difficulty is that one typically doesn’t know the order in which vertices joined the network. A priori, each ordering is equally likely, so one needs to consider all possible permutations of orderings to make valid inferences. This quantity can be expressed as a summation over all permutations, and can be represented as estimating the nor- malizing constant of the distribution that has probability proportional to the model likelihood for each permutation. Since there are a factorial number of permutations, for moderate numbers of vertices direct summation is infeasible and one must resort to approximation techniques. In this setting, crude Monte Carlo tends to work poorly, since most of the likelihood is concentrated on a vanishingly small subset of permu- tations. To reduce the variance of the estimator, adaptive importance sampling is with importance distributions selected from the Plackett-Luce family of permutation distributions. This is a novel use of this family of distributions in the importance sampling context. An example is given using the technique on a modified version of CHAPTER 1. INTRODUCTION 5
preferential attachment. In §8 inference in a special type of network growth model known as stochastic Kronecker product graph (SKPG) is discussed. These models have a simple formu- lation that permits relatively efficient estimation of maximum likelihood parameters. The SKPG model is a generalization the Erd¨os-R´enyi G(n, p) model and implicitly construct a matrix of Bernoulli edge probabilities using Kronecker products of smaller seed matrices. These models suffer from the same difficulty as other network growth models in that in order to compute the likelihood one needs to sum over all possible vertex labeling permutation, but SKPGs have the advantage that it is relatively easy to compute the normalizing constant. To address the permutation issue, Leskovec and Faloutsos [64] use a Markov chain Monte Carlo algorithm over permutation space. However, these models encounter difficulties when there is a mismatch between the model dimensions and the number of vertices in the network data. To address these issues, the vertex censored stochastic Kronecker product graph (VCSKPG) model is introduced, which allows more flexibility in the allowable number of model vertices. A sequential importance sampling scheme is proposed to perform efficient parameter fitting and likelihood estimation for this model. Chapter 2
Approximate Sampling
In many application settings, it is desirable to sample from a distribution of interest π but it is impossible to do so in a reasonable amount of time, i.e. sampling from π is computationally intractable. Approximate sampling refers to a set of methods that at- tempt the next best thing, which is to sample from some distribution γ that is in some sense “close” to π. There is a strong connection between approximate sampling and estimation problems. In particular, Jerrum et al. [53] were able to give a polynomial- time reduction between almost uniform sampling and approximate counting. Another way to view this relationships is through the well-known importance sampling identity (2.6), which gives a zero-variance estimator when sampling from the optimal impor- tance distribution γ∗, and low-variance estimates when approximately sampling from γ∗. By “closeness” of γ to π, one can either use metrics between probability distribu- tions such as total variance distance,
dTV (π, γ) = sup |π(A) − γ(A)|, (2.1) A⊂Ω
6 CHAPTER 2. APPROXIMATE SAMPLING 7
or pseudo-metrics such as the Kullback-Leibler divergence,
π(X) d (πkγ) = E log (2.2) KL π γ(X)
= Eπ [log π(X)] − Eπ [log γ(X)] , (2.3)
where the notation Eπ(X) indicates that the random variable X is distributed ac- cording to π. Although Kullback-Leibler divergence is not a true distance function as it is not symmetric (dKL(πkγ) 6= dKL(γkπ) in general), it does have the property that dKL(πkγ) = 0 iff π(x) = γ(x) for all x with nonzero measure. The two terms on the RHS of (2.3) are the entropy of γ,Eπ [log π(X)], roughly representing sample diversity, and the cross-entropy,Eπ [log γ(X)], representing the “goodness of fit” of γ to π.
2.1 Importance Sampling
Suppose that there exists a random variable X :Ω 7→ R, and one wishes to compute the expected value of X, Z µ ≡ E[X] = X(ω)π(dω), (2.4) Ω where π is the target distribution. In many cases, computing µ exactly is intractable since it is necessary to integrate over the entire state space Ω. For example, the problem of approximating the permanent of a matrix [52] can be represented as
" n # 1 Y perm(A) = E a , (2.5) n! i,X(i) i=1 where X is a random permutation. Computing the permanent is a computationally difficulty and is known to be #P-complete [98]. #P-complete comprises a set of counting problems with no known polynomial-time algorithms, and can be thought of as the counting analog of NP-complete problems. One way to estimate such prob- lems is through crude Monte Carlo simulations, drawing N independent, identically CHAPTER 2. APPROXIMATE SAMPLING 8
distributed (iid) samples x1, . . . , xN , then computingx ˆN . Xb N may, however, have unacceptably large variance, exponentially high in the case of approximating the per- manent. One practical variance reduction technique is known as importance sampling (IS). Importance sampling builds an estimator by sampling from a biased distribution in which the “important” or more heavily weighted states are visited more frequently. Helpful background references for importance sampling include Evans and Swartz [35], Asmussen and Glynn [6], Liu [68], and Robert and Casella [85]. Suppose there exist π(dω) random variables Y, Z, such that Y(ω) = X(ω) and Z(ω) = X(ω) γ(dω) . Importance sampling is based on the following simple identity:
Z π(dω) E[X] = X(ω) γ(dω) = E [Z] (2.6) Ω γ(dω)
The importance sampling identity (2.6) holds as long as Z is well defined, i.e. π(dω) = 0 =⇒ γ(dω) = 0 for ω ∈ Ω. γ is the importance distribution, and the ratio W(ω) ≡ π(dω)/γ(dω) is the importance weight of ω. One can draw N iid samples of
Z and useµ ˆIS ≡ ZbN as the unbiased importance estimator of µ. If var[Z] < var[X],
ZbN will be a better estimate of µ than Xb N . A primary challenge in importance sampling is choosing an importance distribution γ that minimizes var[Z]. Practically, it is often the case that one only knows π and/or γ up to a constant factor, f(dω) = π(dω)/CX, g(dω) = γ(dω)/CY, where CX and CY are the normalizing constants of f and g. The ratio of the normalizing constants is denoted as C ≡
CX/CY, with the random variables Wf(ω) ≡ f(dω)/g(dω) and Ze(ω) ≡ Y(ω)Wf(ω) = CZ(ω). Since E[W] = 1, one can build an unbiased estimator of the ratio C through draws of the unnormalized importance ratios, Wfc N . Using the same samples ω1, . . . , ωN to compute Wfc N as for ZebN leads to the biased importance estimator
ZebN (ω1, . . . , ωN ) µbBIS ≡ . (2.7) Wfc N (ω1, . . . , ωN ) CHAPTER 2. APPROXIMATE SAMPLING 9
Although biased for finite sample sizes,µ ˆBIS is asymptotically unbiased and does not require normalizing constants.. The optimal sampling distribution γ∗ is one for which E[Z2] is smallest. As per Rubinstein and Kroese [89], E[Z2] is minimized with γ∗(dω) ∝ |X(ω)|π(dω). This gives Z E[Z∗2] = Z2(ω)γ∗(dω) (2.8) Ω Z = |X(ω)|π(dω) = E[|X|]. (2.9) Ω
In particular, if X ≥ 0, γ∗ provides a zero variance estimator. Direct computation of the optimal sampling distribution is typically impossible since it relies on the R |X(ω)|π(dω) as a normalizing constant, which is the quantity to be estimated in the first place. However, it is often helpful to use γ∗ as a guide to construct “good” importance sampling distributions.
2.1.1 Effective Sample Size
An important concept when using importance sampling is degeneracy. Degeneracy occurs when the bulk of the importance weight mass is concentrated in one or a small number of importance samples. This is means that the Monte Carlo approximation will be dominated by a small subset of samples, so the “effective” number of samples that we are using to compute the approximation is small. A standard measure of degeneracy for importance sampling methods is therefore the effective sample size [61],
N 1 ESS = = N , (2.10) 1 + var[W] E[W2] √ var[W] where cv refers to the coefficient of variation of the importance weights (cv = E[W] ), E[W] = 1, and var[W] = E[W2] − 1. One justification for the use of the effective sample size comes from Liu [67], based on a note from Kong [61], using the the delta CHAPTER 2. APPROXIMATE SAMPLING 10
method, and states that
var[ˆx ] 1 N ≈ . (2.11) var[µˆBIS,N ] 1 + var[W]
This goes as follows. First note that using the standard delta method for ratio statistics [21] gives
1 var[µˆ ] ≈ (var[Z] + µ2var[W] − 2µcov(Z, W)). (2.12) BIS,N N
Further note that
π(X) cov(Z, W) = E X − µ (2.13) γ(X) π(X) π(X) = cov , X + µE − µ (2.14) γ(X) γ(X) and that
π2(X) var[Z] = E X − µ2 (2.15) γ2(X) π2(X) π(X) π(X) ≈ E[X]E + var(X)E + 2µcov , X − µ2 (2.16) γ2(X) γ(X) γ(X)
Applying this to (2.12) gives
1 var[µˆ ] ≈ var(X)(1 + var(W)), (2.17) BIS,N N which yields (2.11). Note that the remainder term in (2.16) is
π(X) π(X) E − E (X − µ)2 , (2.18) γ(X) γ(X) which can be large depending on the the distribution of X. Typically in practice one knows neither the true variance of the importance weights nor the normalizing constants, so it is common to use the empirical unnormalized CHAPTER 2. APPROXIMATE SAMPLING 11
effective sample size,
2 PN i=1 Wf(ωi) ESS[ = (2.19) PN 2 i=1 Wf(ωi) as a heuristic measure of degeneracy.
2.1.2 Sampling Importance Resampling
Sampling importance resampling (SIR) is a technique for approximate sampling based on importance weights. Although not widely used in general approximate sampling settings, where Markov chain Monte Carlo Methods are often more easily imple- mented and efficient, sampling importance resampling has found a niche in sequential settings where Markov chain based approximate sampling methods carry a heavier computational burden. Sequential importance sampling is discussed further in §3.2. The goal of resampling in the sequential setting is generally to reduce degeneracy. The idea for SIR came originally from Rubin [87], and was introduced in the sequential context by Gordon et al. [42]. SIR uses samples drawn from an importance distribution γ to approximately draw samples from the target distribution π. The procedure to draw M importance sam- ples is as follows. Given N samples y(1), . . . , y(N) drawn according to γ, compute the importance weights w(i) for each sample. Choose an index i with probability proportional to w(i) and assignx ˜ = y(i). This process is repeated M times (with (i) M replacement) to generate a collection of approximate samples, {x˜ }i=1. The qual- ity of the approximate samples depends on the sample size N and how closely γ matches π. Assuming mild regularity conditions on the importance distribution γ, asymptotic convergence to the target distribution can be shown (see Asmussen and Glynn [6],p.387). Since only relative importance weights are needed for resampling, normalizing constants are not needed. (i) N One possible use of the collection of approximate samples {x }i=1 is to construct CHAPTER 2. APPROXIMATE SAMPLING 12
Algorithm 1 Sampling Importance Resampling (SIR) (1) (N) Draw N samples y , . . . , y ∼ Py. (i) N Compute importance weights {w }i=1. (1) (M) (i) N Draw M samplesx ˜ ,..., x˜ with replacement from {y }i=1 with probabilities (i) N proportional to {w }i=1. an estimator of E[X]
N −1 X (i) µˆSIR = N x˜ . (2.20) i=1
However, this estimator is not very useful in practice, asµ ˆSIR will always have higher variance thanµ ˆIS due to the additional variance from multinomial sampling. A better resampling method that can reduce the multinomial noise known as residual resampling, introduced by Liu and Chen [69], is as follows. First, normalize the (i) (i) PN (j) (j) importance weights to sum to one,w ˜ = w / j=1 w . Then take bMw˜ c PN (j) (j)0 copies of each sample j, for a total of k = j=1bMw˜ c samples, and setw ˜ ← w˜(j) − bMw˜(j)c, renormalize the importance weights, and take M − k samples with replacement from the resulting multinomial distribution. This procedure will make the same expected numbers of copies of each sample, but if k is large can greatly reduce the multinomial noise. Another variation of SIR introduced by Skare et al. [94] uses modified importance weights. Instead of choosing samples with probabilityw ˜(i), one can use weights pro- portional tow ˜(i)/(1 − w˜(i)). Using these weights with M fixed and letting N → ∞, Skare et al. [94] were able to show point-wise rates convergence for Xe to X at rate O(N −1) when sampling with replacement and O(N −2) when sampling without re- placement. The idea of sampling without replacement for SIR when M << N orig- inally comes from Gelman [38], and intuitively can be thought of as producing an “intermediate representation” between the sampling and target distributions. CHAPTER 2. APPROXIMATE SAMPLING 13
2.2 Markov Chain Monte Carlo
Markov chain Monte Carlo is among the most widely used methods for difficult ap- proximate sampling and estimation problems. This is due to the fact it is often easy to design an ergodic Markov chain that holds a target distribution π as its station- ary distribution for a wide variety of distributions [4], which can then be used to draw approximate samples from π and to construct Monte Carlo estimators. Tech- niques for designing such Markov chains include the Metropolis-Hastings algorithm, data augmentation, the Gibbs sampler, and the hit-and-run algorithm. For further background on Markov chain Monte Carlo methods, refer to Liu [68], Asmussen and Glynn [6], and Robert and Casella [84].
2.2.1 Markov Chains
A random variable X1:t is a discrete valued Markov process if the distribution of Xt given the most recent state xt−1 is independent of all the previous states, or
P(Xt|x1:t−1) = P(Xt|xt−1). (2.21)
This property of Markov chains is referred to as memorylessness. If the possible values for each Xt form a countable space, then this type of process is known as a Markov chain. Markov chains are defined by a kernel K(v, v0) specifying the relative 0 probability that Xt+1 = v given that xt = v. One way to think of a Markov chain is as a random walk in a weighted directed graph G(V,E) with non-negative edge weights. Possible states are represented by vertices, and the next state xt+1 given the current xt is chosen with probability proportional to edge weights. The expected hitting time of a state v ∈ V is the expected number of steps for the Markov chain starting in state v to return to v. If a state has a finite expected hitting time, it is positive recurrent. The periodicity of a state v is the greatest common divisor amongst all possible hitting times; if the periodicity of v is 1 then v is said to be aperiodic. If all states v ∈ V are positive recurrent and aperiodic, then the Markov chain is said to be ergodic, and admits a unique stationary distribution π, such that CHAPTER 2. APPROXIMATE SAMPLING 14
Kk(v, v0) → π(v) as k → ∞.
2.2.2 Metropolis Hastings
The Metropolis-Hastings algorithm [73, 46] was one of the first Markov chain Monte Carlo methods proposed and used in practice, and is still one of the most commonly used forms. Its popularity can be attributed to the simplicity of designing efficient mechanisms to approximately sample from an arbitrary distribution π. The two main requirements necessary for the algorithm is that the relative probabilities for two states v, v0 under π, π(v0)/π(v) can be computed, and that an ergodic proposal Markov chain on the state space with transition kernel K can be sampled from. No normalizing constants are required, and the algorithm works extremely well for many applications [27]. The procedure is as follows. Starting from a state v, take a step in the Markov chain K to state v0, and compute the acceptance ratio
π(v0) K(v0, v) a = . (2.22) π(v) K(v, v0)
If a > 1, then move to state v0, otherwise draw a uniform random variable u and move to state v0 if a > u, otherwise stay at v. One potential difficulty with this algorithm is in the choice of proposal kernel K. K should ideally be chosen such that K(v, v0) is approximately π(v0), which can sometimes be difficult in practice. An inappropriate choice of K can induce the Metropolis chain to have an extremely low rate of convergence. Another potential issue is that the Metropolis chain may not be ergodic even if the proposal chain is. However, for many important problems these difficulties can be overcome, and Metropolis-Hastings has proved to be extremely useful in a wide variety of contexts and has been named one of the most important algorithms of the 20th century [10].
2.2.3 Gibbs Sampler
The Gibbs sampler (introduced by Geman and Geman [39], see Liu [68, chapter 6] for a good introduction) is another fundamental tool used for designing ergodic Markov CHAPTER 2. APPROXIMATE SAMPLING 15
chains. Suppose that random variable X takes values in state space Ω with probability distribution π, and that for x ∈ Ω, x can be decomposed as x = {x1, . . . , xp}. The Gibbs sampler is as follows. Starting with an initial pointx ˜, cycle through coordinate indices j = 1, . . . , p, for each j samplingx ˜j according to the conditional distribution
x˜j ∼ π(xj|x˜[−j]), (2.23)
where the notation x[−j] indicates taking all coordinate indices of x except for j, or
def x[−j] = {x1, . . . , xj−1, xj+1, . . . , xp}. (2.24)
The method of cycling through coordinate indices is a design choice. If j is chosen uniformly at random at each step, the procedure is known as random-scan Gibbs sampling (summarized below in Algorithm 2); if instead one cycles through coordi- nate indices in a predetermined order, it is called systematic-scan Gibbs sampling.
Denote the Markov transition kernel for random-scan Gibbs sampling as KGibbs, with
KGibbs(x, y) giving the probability density for the next state y conditioned on the current state x. Under mild conditions the Gibbs sampler will be positive recurrent
Algorithm 2 Random-scan Gibbs sampler.
Start from initial pointx ˜ ← {x˜1,..., x˜p}. while (not converged) do Pick an index j at random. Samplex ˜j ∼ π(xj|x˜[−j]). end while and aperiodic (ergodic), and will therefore admit π as a stationary distribution. It is easy to see that sampling from the conditional distribution (2.23) leave π invariant, so assuming ergodicity the Gibbs sampler has π as a stationary distribution. One is not restricted to only sampling from the single variable conditional distri- butions in (2.23). If one chooses to sample from the joint conditional distribution for multiple subindices j1, . . . , jp at once,
(˜xj1 ,..., x˜jp ) ∼ π(xj1 , . . . , xjp |x˜[−j1,...,−jp]), (2.25) CHAPTER 2. APPROXIMATE SAMPLING 16
it is known as grouped Gibbs sampling. In grouped Gibbs, instead of sampling from the line defined byx ˜[−j] as in standard Gibbs, one samples from the hyperplane defined byx ˜[−j1,...,−jp] for some set of coordinate indices j1, . . . , jp. The set of coordinate indices may be chosen deterministically or according to a randomized scheme. It is not difficult to see that grouped Gibbs results in a faster mixing rate relative to standard Gibbs, see Liu [68] for details.
2.2.4 Data Augmentation
The data augmentation algorithm, proposed by Tanner and Wong [96], can be thought of as a special case of the Gibbs sampler on a two variable space X = {X1, X2}, where one is primarily interested in sampling from the first sub-variable X1. X2 is called the auxiliary variable. As in the standard Gibbs sampler, we draw samples of x1 according to π(x1|x2), then draw samples of x2 according to π(x2|x1), and repeat until satisfied. This yields joint samples x approximately distributed according to
π(x), and one may then “discard” the auxiliary variable to get samples x1 ∼ π(x1).
The “art” of data augmentation [99] is in the choice of auxiliary variable x2, which should ideally be chosen such that the conditional distributions π(x1|x2), π(x2|x1) are easily computed, and such that the resulting Markov chain is rapidly mixing.
2.2.5 Hit-and-Run
A generalized form of grouped Gibbs is known as the hit-and-run algorithm [3], where for current state x, one randomly chooses a subset Lk ⊂ Ω with probability w(Lk), then samples the next state y according to the transition matrix Kk(x, y). The intuition behind hit-and-run is that it is not necessary to restrict oneself to sampling from hyperplanes of the state-space, but rather we can choose arbitrary subsets L to sample from. In order to ensure convergence to the proper stationary distribution π, it is necessary to choose L, w, and K such that Kk(x, y) has stationary distribution proportional to wx(k)π(x). Also note that ergodicity of the hit-and-run Markov chain depends on the choices of L, K, and w.
For each x ∈ Ω and coordinate index j, define subset indices kx,j such that kx,j = CHAPTER 2. APPROXIMATE SAMPLING 17
Choose L, w, K such that for each subset index k, Kk(x, y) has stationary distribution proportional to wx(k)π(x).
Hit-and-run requirements
ky,j if and only if x[−j] = y[−j]. If one chooses subsets L and weights w such that
Lkx,j = {y : y[−j] = x[−j]} (2.26) w (L ) = 1/n1 , and K (x, y) proportional to π(y)1 , then this is equiva- x˜ kx,j x=˜x kx,j y∈Lkx,j lent to single variable random-scan Gibbs. This can be seen to satisfy the hit-and-run requirements, since for each x ∈ Lk, the probability of choosing Lk is 1/n. Hit-and- run also generalizes several other popular Markov chain Monte Carlo algorithms, including Swendsen-Wang, data augmentation, and slice sampling. See Andersen and Diaconis [3] for more details. Chapter 3
Sequential Monte Carlo
In many important cases, one would like to analyze and develop models for data that are ordered or sequential in some natural way. This situation occurs in many examples such as data where there is a time component, i.e. time-series data, but also situations such as sampling from the space of self-avoiding walks [44, 86], contingency tables [23], and graphs with a prescribed degree sequence [14, 9]. For inference in these models, sequential Monte Carlo [31] methods have been developed. Sequential Monte Carlo methods include the class of algorithms known as particle filters, which were introduced in their current form by Gordon et al. [42] as the bootstrap filter. Particle filters are iterative, consisting of two basic elements: sequential importance sampling (SIS) and sampling importance resampling (SIR).
3.1 Sequential Models
First, some notation will be introduced. For each sample ω ∈ Ω, there is a function X :Ω 7→ ΩT , where Ω is the state space, and T is a positive integer. Under this formulation, X is a discrete-time stochastic process. Typical examples include cases where the state space Ω is finite, Nn, or Rn, but more complicated examples can be considered as well. A sample x can be written as an array, X(ω) = {Xt(ω)}t∈1,...,T .
As a shorthand, the ω is dropped and X1:T is referred to as a random variable with sub-indices X1,..., XT .
18 CHAPTER 3. SEQUENTIAL MONTE CARLO 19
Figure 3.1: Dependence structure of hidden Markov models
In a sequential model, it will be assumed that evaluating and generating samples from the conditional probabilities P(xt|x1:t−1) for 1 ≤ t ≤ T is computationally feasible. In a hidden Markov model, X is a Markov process, and there is another coupled discrete-time stochastic process Y such that the conditional distribution of
Yt given xt is independent of the rest of the x and y variables, or
P(yt|x1:T , y1:t−1, yt+1:T ) = P(yt|xt). (3.1)
X is known as the hidden or latent process and Y is called the observation process. A diagram showing the dependence structure of hidden Markov models is shown in Figure 3.1. Practical applications of of hidden Markov models typically involve samples from the observation process Y, and it is desired to make inferences about the latent process X. For example, one may wish to compute (and draw samples from) P(X|y) or to compute the expected value of functions of X conditioned on y. Another common problem is that the X and Y processes may be defined by some set of parameters θ and for which we would like to find maximum likelihood estimates. In the case where the hidden process X is governed by an affine Gaussian process and the observation model Yt conditioned on the hidden state xt is also affine Gaus- sian, one can use the Kalman filter [55] to efficiently make direct inferences about (and sample from) X conditioned on y. The Kalman filter is an iterative procedure that computes first E[Xt|xt−1], then updates based on current observation yt to get CHAPTER 3. SEQUENTIAL MONTE CARLO 20
E[Xt|, xt−1, yt]. This is repeated for s = 1, . . . , t. Due to the Gaussian nature of the processes, knowing E[Xt|, xt−1, yt] for each t is sufficient to compute and sample ac- cording to the full conditional distribution P[x1:T |y1:T ]. Estimates for the respective covariance matrices can also be determined iteratively. The efficiency of the Kalman filter makes it useful in real-world applications where Gaussian models are appropriate. In cases where the underlying hidden and obser- vation processes are non-linear, versions of the Kalman filter such as the extended Kalman filter [51] or the unscented Kalman filter [54, 103] may be useful. However, these methods may not be applicable in cases where the underlying state-space and observation processes are highly non-linear, or take values in non-Euclidean state spaces such as graphs or other combinatorial objects. In these cases, particle filtering (§3.3) offers an attractive alternative. See chapter 6 for an application of the Kalman and particle filters to multi-object tracking.
3.2 Sequential Importance Sampling
Suppose X and Y are the latent and observation processes of a hidden Markov model.
If it is known how to sample Xt according to the law
P(xt|y1:T , x1:t−1) = P(xt|yt:T , xt−1), (3.2) where the equality is due to the independence properties of X and Y, one can sam- ple directly from X|y sequentially. Note, however, that this distribution is condi- tioned on all future observations yt:T for each xt. For non-Gaussian processes, sam- pling from the optimal importance distribution is usually impractical, so instead one can sample according to some other distribution γ and use importance sam- pling. Denote the tth contribution to the sequential target distribution as πt(x) ≡
P(xt, yt|xt−1)/P(yt|y1:t−1), and the target distribution of the first t states given the CHAPTER 3. SEQUENTIAL MONTE CARLO 21
first t observations as π1:t(x) ≡ P(x1:t|y1:t). π1:t can be built sequentially as follows:
P(yt|x1:t, y1:t−1)P(x1:t|y1:t−1) π1:t(x) = (3.3) P(yt|y1:t−1) P(yt|xt)P(xt|xt−1) = P(x1:t−1|y1:t−1) (3.4) P(yt|y1:t−1) P(xt, yt|xt−1) = π1:t−1(x) (3.5) P(yt|y1:t−1)
= πt(x)π1:t−1(x) (3.6)
A sequential importance distribution γ1:t(x) is chosen to be defined sequentially such that there exist functions γt(x) with
t Y γ1:t(x) = γs(x) (3.7) s=1
The tth sequential contribution to the importance weight is Wt(x) = πt(x)/γt(x), and the importance weight after t time-steps is
t Y W1:t(x) = Ws(x). (3.8) s=1
Remarks:
• The denominator of πt, P(yt|y1:t−1), is independent of x and is not required for approximate sampling. Since it may be difficult to compute, the unnormalized value is often used,
π˜1:t(x) = P(xt, yt|xt−1)˜π1:t−1(x). (3.9)
π˜1:t(x) has P(y1:t) as its normalizing constant. One can similarly use an unnor- malized importance distributionγ ˜ and importance weight W˜ .
• π˜t(x) ∝ P(xt|yt, xt−1)P(yt|xt−1), so one sensible possibility is to choose as a se-
quential importance distribution γt(x) proportional to P(xt|yt, xt−1). Note that CHAPTER 3. SEQUENTIAL MONTE CARLO 22
this is the optimal (zero variance) importance distribution for Xt given yt and
xt−1. However, xt chosen in this manner is no longer optimal at time-step t + 1,
since πt+1 depends on xt through P(yt|xt−1). For this reason P(xt|yt, xt−1) is called the locally optimal choice of importance distribution. The sequential con-
tribution to the relative importance weights in this case is Wt(x) ∝ P(yt|xt−1).
3.3 Particle Filter
One serious issue with sequential importance sampling is that the importance weights can become degenerate after a small number of steps, with most of the importance weight mass concentrated on a small subset of samples, even if using the locally opti- mal importance distribution. To address this issue, Gordon et al. [42] suggest the use of sampling importance resampling (SIR) in conjunction with sequential importance sampling (SIS) to create what is now referred to as sequential Monte Carlo. When the underlying model is hidden Markov, then this is known as the particle filter.A standard reference for particle filtering is Doucet and De Freitas [31].
Algorithm 3 Particle filter approximate sampling. (i) Initialize eachx ˜0 ∼ π0 for i = 1,...,N. for t ∈ 1,...,T do (i) (i) Draw eachx ˜t ∼ γt(·|x˜t−1). (i) Update sequential importance weights W1:t. according to (3.8). Compute effective sample size ESS[ according to (2.19). ˆ (i) (i) If ESS[ < N, resample eachx ˜ with probability proportional to Wt . end for
To see anecdotally why sequential importance sampling results in degenerate sample sets, consider the following case. Suppose that the sequential importance weight Wt(X) is distributed identically identically and independently for each t. This would occur, for example, if γ is an affine Gaussian process. Then as t → ∞, log W1:t(x)/σ(log(W1:t(x))) → N(0, 1). Therefore, in the limit W1:t is distributed according to a lognormal distribution, which is heavy-tailed. This implies that gen- erally speaking we’d expect a small number of samples to have very high importance CHAPTER 3. SEQUENTIAL MONTE CARLO 23
weights relative to the other samples. Low importance weight samples are also less likely to have high importance weights in the future. To address this issue, at each step one can perform sampling importance re- sampling. If SIR is applied at time-step t, this will give an approximate sample from π1:t(x). Note that for t < T this is not an approximate sample from the true target distribution π1:T (x1:t) which incorporates all future observations. Instead the algorithm approximately samples from a sequence of intermediate distributions,
π1, π1:2, . . . , π1:T . In this manner sequential Monte Carlo has parallels to simulated annealing, which also uses a sequence of approximate samples from intermediate dis- tributions to sample from a target distribution (see §4.3.1 for more). Applying SIR has the effect of “pruning” the sample space, correcting previous “mistakes” and ensuring that the algorithm doesn’t waste time on low probability paths. Since each resampling adds multinomial noise, requires computational effort, and there are many possible groupings to resample, it is desirable to only apply resampling when there is a clear benefit. For this reason, it is common practice to set some threshold Nˆ and resample when the empirical effective sample size ESS[ is below this threshold. Chapter 4
Adaptive Importance Sampling
4.1 Background
Adaptive importance sampling (AIS) takes an indirect approach to building low vari- ance estimators of E[X]. The goal of adaptive importance sampling is to find the best importance distribution g∗ within a restricted family of distributions G(·; v) with pa- rameters v whose likelihood functions are known or relatively easy to calculate. The idea is to iteratively build distributions within G based on a population of previously generated importance samples. A generic implementation of adaptive importance sampling is given in Algorithm 4. There are several different possible choices of how to update the distribution parameter v based on the population of importance sam- ples, including variance minimization [88] and the cross-entropy method [89]. As in the sequential setting, degeneracy is an important issue, and sampling importance resampling (see §2.1.2) may be used [20] in a similar context as in the particle filter algorithm.
24 CHAPTER 4. ADAPTIVE IMPORTANCE SAMPLING 25
Algorithm 4 Adaptive importance sampling (AIS) algorithm (i) (0) Generate (x0 )1≤i≤N ∼ g0 . (i) (i) (0) (i) Compute W0 = f(x0 )/g0 (x0 ). (i) (1) (i) Generate (˜x0 )1≤i≤N by resampling (x0 )1≤i≤N based on W0 . k ← 0 while (not converged) do k ← k + 1 (i) Update gk based on {x˜k−1}i≤N restricted to G. (i) Generate (xk )1≤i≤N ∼ gk. (i) (i) (i) Compute Wk = f(xk )/gk(xk ). (i) (i) (i) Generate (˜xk )1≤i≤N by resampling (xk )1≤i≤N based on Wk . end while
One of the primary challenges in designing an effective adaptive importance sam- pling algorithm is in choosing the parametric family of distributions G. The choice of distribution family affects the quality of the estimator and efficiency of the algo- rithm. See Rubinstein and Kroese [89] for examples of some commonly used proposal distributions in a range of application settings. Ideally, the family of proposal distributions G should have enough flexibility to be able to specify a distribution which frequently visits “important” states. In other words, the best possible proposal distributiong ˆ∗ ∈ G, should be close to the optimal importance distribution, g∗. On the other hand, the family of distributions G should be as simple as possible, i.e. have a relatively small parameter space and be easy to fit to the data, both to avoid overfitting and so thatg ˆ∗ can be found with minimal computational effort. Intuitively, the family of distributions should contain a good “model” of the optimal distribution and easily incorporate some of the underlying structure of the problem.
4.1.1 Variance Minimization
A primary design issue in the adaptive importance sampling algorithms is in the choice of update rules for the importance distribution gk. Given a sample x = {x1, . . . , xN }, CHAPTER 4. ADAPTIVE IMPORTANCE SAMPLING 26
one natural rule would be to take the next proposal distribution v as one that mini- mizes the sample variance of (2.6)
gˆ = argmingvarg [h(X)W(X)] . (4.1)
f(x) Since the expected value of h(x) g(x) is constant for g with full support, this reduces to minimize its second moment
2 2 gˆ = argmingEg h (X)W (X) . (4.2)
This is the idea underlying the “variance minimization” (VM) procedure: at step (t) t, construct a new IS distribution gVM by choosing a new sampling distribution that (i) minimizes (4.2) over samples (xt−i)1≤i≤N . Often there is no analytic solution to (4.2), requiring numerical non-linear optimization at each step.
4.1.2 Cross-Entropy Method
Instead of minimizing the variance at each step, one can choose a distribution gCE that minimizes the Kullback-Leibler cross-entropy between gCE and the sample op- timal importance distribution g∗(x). In the cross-entropy method [89], a sequence of distributions parameterized by v1, v2, . . . , vk is iteratively built. At each step, the Kullback-Leibler divergence (2.3) from g∗ to f based on previously drawn samples is minimized. One principal advantage to this method is that minimizing cross-entropy is often more computationally tractable than finding the minimum sample variance. Recall that the Kullback-Leibler divergence from g to f can be stated as
dKL(gkf) = Eg [log g(X)] − Eg [log f(X)] . (4.3)
In cross-entropy adaptive importance sampling, the optimal parameter set v∗ that ∗ minimizes dKL(g kf) is defined as
∗ v = argmaxvEg∗ [log f(X; v)] , (4.4) CHAPTER 4. ADAPTIVE IMPORTANCE SAMPLING 27
since the first term in (4.3) is constant with respect to v. One can rewrite this as
∗ W(X) v = argmax E ∗ log f(X; v) (4.5) v g W(X) g∗(X) = argmax E log f(X; v) (4.6) v w W(X) for some reference distribution W(x). Since solving (4.6) directly is typically in- tractable, Rubinstein and Kroese [89] suggest an iterative procedure, using the distri- bution f(·; vt−1) as the reference distribution to solving for vt. Since one can estimate the expectation in (4.6) via Monte Carlo simulation
g∗(X) 1 1 X Evt−1 log f(X; v) ≈ Wt−1(xi) log v(xi), (4.7) f(X; vt−1) Cg∗ N x1,...,xN ∼f(·;vt−1)
|h(x)|P(G|H,x) where Wt−1(x) = is the importance weight of x under vt−1, and Cg∗ is f(x;vt−1) the normalizing constant of g∗. The authors of Rubinstein and Kroese [89] suggest approximating f(·; v∗) at itera- tion t by maximizing v in (4.7) with respect to the empirical distribution of f(·; vt−1),
1 X v = argmax W (x ) log f(x ; v) (4.8) t v N t−1 i i x1,...,xN ∼f(·;vt−1)
Note that for the purpose of computing vt, one can ignore the constant multiplier
Cg∗ . Maximizing the sum (4.8) is equivalent to finding the maximum likelihood estima- tor under v with sample xi replicated Wt−1(xi) times. This leads to an interpretation of the cross-entropy method as iterative weighted maximum likelihood estimation[40]. For many families of distributions, computing the maximum likelihood estimator is fast, efficient, and well understood, so optimizing the sum (4.8) is often straightfor- ward. For more on this connection between cross-entropy and maximum likelihood see Asmussen and Glynn [6]. CHAPTER 4. ADAPTIVE IMPORTANCE SAMPLING 28
4.2 Avoiding Degeneracy
Directly optimizing (4.8) to find a set of parameters vt may not always produce a distribution f(·; vt) that robustly generates “good” samples. Unless sample sizes are large, this estimator may tend to over-fit the empirical data and lead to a degenerate approximating distribution f(·; vt), with a fit based on a sample set in which the bulk of the importance weights are concentrated in one or a small number of samples. This situation can be diagnosed by a low effective sample size (see §2.1.2 for details). Rubinstein and Kroese [89] suggest several heuristic alterations to (4.8) which implicitly address the degeneracy issue. First, instead of optimizing (4.8) over all samples drawn from f(·; vt−1), one can only use the top ρ percentile of the sample for some constant 0 < ρ < 1 and weight them all equally. This is known as the elite sample technique. Second, after computing the minimum sample cross-entropy, one can “smooth” it with the previous importance distribution according to a weight α (in distribution families where such a parameter smoothing makes sense). Third, in the fully adaptive cross-entropy method one adaptively adjusts the sample sizes drawn from f(·; vt) based on performance metrics. These adjustments work well for some types of problems, as is demonstrated in the examples found in Rubinstein and Kroese [89]. For the complicated distributions on orderings encountered in §7, however, the basic cross-entropy method does not perform well, quickly leading to highly degenerate importance distributions. It also seems a worthwhile goal to fit the cross-entropy degeneracy problem within a larger information theoretic or statistical learning context where many tools for for- mally dealing with over-fitting have been developed. One robust criteria for the avoidance of over-fitting is the minimum description length (MDL) principle [43]. Minimum description length is a generalization of Occam’s razor, which is principle that says that one should use the simplest model that best fits the data. In practi- cal terms, minimum description length gives an expression for the trade-off between model complexity and “goodness of fit”. In addition to describing the model well (as measured by likelihood), one should compensate for over-fitting by taking into CHAPTER 4. ADAPTIVE IMPORTANCE SAMPLING 29
account the description length of the model, L(f(·; vt)). This is expressed as
L(f(·; vt), x) = L(f(·; vt)) − log(P(x|f(·; vt))) + const. (4.9)
See MacKay [70] chapter 28 for a very readable introduction to minimum description length.
One interpretation of the description length is to take L(f(·; vt)) to be the Shannon entropy of distribution v, i.e. L(f(·; vt)) = −Evt (log f(·; vt)). This is quite natural from an information theoretic perspective, and directly penalizes distributions for deviation from the uniform distribution. A disadvantage to this formulation is that it may be difficult to directly estimate the entropy for complicated models. Another interpretation is to see L(f(·; vt)) as the log-likelihood of the parameters under some Bayesian prior. This may have the advantage of being easier to compute than the model entropy, but it may not be as interpretable as a measure of model complexity. We can use minimum message length principles when choosing a new distribution vt based on previously drawn samples. However, there is some ambiguity on how to best apply the technique to adaptive importance sampling. For instance, it is unclear how to properly weigh L(f(·; vt)) versus log(P(x|f(·; vt))) in the objective function (4.9). In the standard model selection framework when fitting to samples drawn from the “true” distribution of interest, no special weights are necessary. As the sample size increases, the log-likelihood term comes to dominate, allowing for more complicated models if they fit the data better. In the adaptive importance sampling setting, however, weighted samples drawn from the previous importance distributions are used only as a proxy for the optimal importance distribution g∗ when performing an update. Although many samples may be drawn, the effective sample size with respect to g∗ is often small, with the top few samples having importance weights orders of magnitude greater than the rest. In the iterative model selection framework, this means one should put a weight on the goodness-of-fit term based on the effective sample size. In §7, several different heuristics are given for estimating this weight, such as gradually increasing the allowed entropy or assigning randomized weights to different samples when computing the effective sample size. CHAPTER 4. ADAPTIVE IMPORTANCE SAMPLING 30
4.3 Related Methods
4.3.1 Annealed Importance Sampling
If one can sample directly from the optimal importance distribution, the way to use importance sampling to estimate the normalizing constant Cf of distribution πf (with unnormalized function f) when a reference distribution g with known normalizing constant Cg is accessible is to use what Neal [76] refers to as the simple importance sampling estimator,
Cf f(X) = Eg (4.10) Cg g(X) N 1 X g(xi) ≈ , (4.11) N f(x ) i=1 i where (xi)1≤i≤N ∼ g. It is easy to see that (4.11) will only have high variance if g and f are not close to one another, especially if g(x) is large when f(x) is small. One remedy for this is to use a sequence of distributions g = q0, q1, . . . , qn−1, qn = f and to chain the estimators together:
n−1 n−1 Cf Y Cj+1 Y qj+1(Xj) = = E . (4.12) C C qj q (X ) g j=1 j j=1 j j
This is well known in computational physics as umbrella sampling. It has also been applied successfully in the approximate counting context for applications such as counting the number of matchings in a graph [93] or estimating the volume of convex bodies [33]. The general idea is one of recursive estimation of size [2]. Further examples can be found in Diaconis and Holmes [28]. One challenging issue associated with this family of techniques is that it may be very difficult or impossible to sample directly from an intermediate distribution of interest qj. The most straightforward technique is to use an ergodic Markov chain that holds qj as its stationary distribution to approximately sample according to qj. One can then draw correlated samples at each step of the Markov chain to build CHAPTER 4. ADAPTIVE IMPORTANCE SAMPLING 31
an estimator. It may be necessary to have a long “burn-in” at each intermediate step to avoid correlations and ensure convergence to stationarity [76]. The degree of correlation of the Markov chain samples will depend on the mixing rate. See §2.2 for more details.
If subsequent distributions qi and qi+1 are close enough to each other, each term in the product (4.12) will have low variance and this will produce a good estimate of Cf .
However, if the samples xj+1 are drawn using a Markov chain based on xj, this will introduce bias. As a remedy to this issue, Neal [75] introduced annealed importance sampling. The idea is to use a sequence of reversible transition kernels (Ti(x, y))1≤i≤n, where Ti(x, y) has stationary distribution πi, known to within a normalizing constant as qi. One can then produce a sample xn approximately distributed according to f, in a similar way as in simulated annealing [58], by starting at a state x0, drawn from g and applying the transition kernel Ti to xi−1 to generate xi for each 1 ≤ i ≤ n in (j) sequence. One can then compute the importance weight of each sample xn as
n−1 Y qi+1(xi) W(j) = . (4.13) q (x ) i=0 i i
Then an unbiased estimator of Cf /Cg is
N Cf 1 X = W(j). (4.14) C N g j=1
Note that although annealed importance sampling produces an unbiased estimator, it may still have high variance depending on the sequence of transition kernels chosen (similar to the choice of annealing schedule in simulated annealing). The variance can also be reduced by taking a larger number of steps for each transition kernel.
4.3.2 Population Monte Carlo
A generalized framework for the techniques in this chapter is given by Cappe et al. [20] under the name population Monte Carlo (PMC) and is summarized below in Algorithm 5. Instead of only allowing an importance distribution g to take values in CHAPTER 4. ADAPTIVE IMPORTANCE SAMPLING 32
a parametric family G based on the set of sample from the last iteration, population Monte Carlo allows more general update rules and can use samples from any of the previous time-steps. There are many possible methods that have been proposed to update g based on a population of samples, such as using mixture-type models [30].
Algorithm 5 Generic population Monte Carlo (PMC) algorithm (i) (0) Generate (x0 )1≤i≤N ∼ g0 . (i) (i) (0) (i) Compute W0 = f(x0 )/g0 (x0 ). (i) (1) (i) Generate (˜x0 )1≤i≤N by resampling (x0 )1≤i≤N based on W0 . k ← 0 while (not converged) do k ← k + 1 (i) Update gk based on {x˜j }j Conditional Sampling Importance Resampling In this chapter, conditional sampling importance resampling (CSIR), an extension of sampling importance resampling (SIR) is considered (see §2.1.2 for background on SIR). Conditional SIR applies to cases where it is difficult to draw directly from a target conditional distribution π but it is possible compute conditional distributions π(xi|xj) for coordinate indices i, j up to a constant factor. This setting often naturally occurs in many sequential importance sampling contexts. 5.1 Motivation In many application settings, it is not feasible to generate draws from the conditional distribution (2.23), but one would like still like to apply the Gibbs sampler. For this purpose, Koch [60] proposed Gibbs SIR, which at each iteration performs a SIR based on conditional draws from an importance distribution γ. This algorithm is as follows. Given an initial pointx ˜, for coordinate index j, draw N conditional samples (1) (N) (i) xj , . . . , xj according to γ(xjxs|x˜[−j]), then compute importance weights Wj = (i) (i) (i) N π(xj |x˜[−j])/γ(xj |x˜[−j]), and drawx ˜j from {xj }i=1 with probability proportional to (i) Wj . Due to the point-wise convergence of SIR (under mild conditions on γ and π) the conditional probabilities go to (2.23) as N → ∞, so it is easy to see that as 33 CHAPTER 5. CONDITIONAL SAMPLING IMPORTANCE RESAMPLING 34 R,N → ∞ this process converges to π, where R is the number of Gibbs steps. The motivating application of this in Koch [60] is to image reconstruction. Algorithm 6 SIR Gibbs Start with arbitraryx ˜. while (not converged) do Randomly choose coordinate index j. (i) N Draw N iid samples {xj }i=1 according to γ(xj|x˜[−j]). (i) (i) (i) Compute importance weights Wj = π(xj |x˜[−j])/γ(xj |x˜[−j]). (i) N (i) Sample xj from {xj }i=1 w/probability proportional to wj . end while A primary difficulty with the Gibbs SIR procedure is that it may be compu- tationally expensive to draw samples from the conditional importance distribution γ(x|x˜[−j]). This is particularly the case in sequential Monte Carlo contexts where importance distributions may be complicated and built sequentially over many time- steps. Below a procedure is detailed that builds approximate conditional samples using only samples from joint importance distribution, which is referred to as con- ditional sampling importance resampling (CSIR), and we give a simple example for the case of the multivariate normal distribution. Section §6 develops the motivating application to multi-object tracking. 5.2 Conditional Resampling The goal in conditional SIR, as in standard SIR, is to approximately draw samples of a target distribution π using samples drawn from an importance distribution γ, both taking values in state space Ω. It is assumed that γ has full support of π, i.e. π(x) > 0 ⇒ γ(x) > 0. Suppose that elements x ∈ Ω can be partitioned as x = {x1, . . . , xn}, and that one has access to the importance and target conditional distribution functions, π(xj|x[−j]) and γ(xj|x[−j]), up to a constant factor, but that these distributions are difficult to sample directly. Suppose that one draws a collection (i) of iid samples {x }i=1...N from γ. One can decompose these as a collection of samples (i) N (i) N from the marginal distributions, {x1 }i=1,..., {xn }i=1. CHAPTER 5. CONDITIONAL SAMPLING IMPORTANCE RESAMPLING 35 The idea of CSIR is then to use these marginal samples as importance samples for the purpose of approximately sampling from the conditional distributions. One can think of this as a step of an approximate Gibbs Markov transition kernel. A step of CSIR Gibbs is as follows. Starting fromx ˜ ∈ Ω, pick a coordinate index j at random, (i) (i) (i) (i) N compute importance weights Wj = π(xj |x˜[−j])/γ(xj ), and drawx ˜j from {xj }i=1 (i) with probability proportional to Wj . This process is summarized in Algorithm 7. Algorithm 7 CSIR Gibbs (i) N Draw N independent importance samples x i=1 ∼ γ . Start with arbitraryx ˜. while (not converged) do Pick an index j at random. (i) (i) (i) Compute Wj = π(xj |x˜[−j])/γ(xj ) for i = 1,...,N. N n (i)o (i) Drawx ˜j from xj w/prob. proportional to Wj . i=1 end while Theorem 1. If the Gibbs sampler is ergodic for π, and for each x ∈ Ω and index j, π(xj|x[−j])/γ(xj|x[−j]) < ∞, then as N → ∞, CSIR Gibbs is ergodic and has stationary distribution π. Proof. For each sample/coordinate pair x and j, the CSIR Gibbs transition kernel KCSIR converges to the Gibbs transition kernel K due to the convergence of SIR. One of the main advantages of CSIR is that it can take advantage of indepen- dence structure to greatly improve the quality of the approximate sample versus SIR. This can give lower variance estimators than the standard importance sampling esti- mate,µ ˆIS, unlike SIR. In order to properly take advantage of independence structure, chapter §6 develops the notion of grouping subsets for CSIR, similar to the grouped Gibbs sampler or the Lk sets in the hit-and-run algorithm from §2.2.5. A disadvan- tage of CSIR is that extra computational effort may be needed for approximating conditionals. CHAPTER 5. CONDITIONAL SAMPLING IMPORTANCE RESAMPLING 36 5.2.1 Estimating Marginal Importance Weights CSIR requires computation of marginal importance distributions γ(xj), which for complicated importance distributions is often impractical. There are several ways to do this. One way is to take the one-off conditional distribution of the sample, γ(xj|x[−j]). Since γ(xj) = E[γ(xj|X[−j])], this is an unbiased estimator. In cases where γ(xj) is independent of γ(x[−j]) this is exact. Another method is to use the PN (i) Monte Carlo estimator, γ(xj) ≈ i=1 γ(xj|x[−j]). Instead of using all N indices, one can also use a random subset. Empirically, the Monte Carlo estimate appears to perform well since γ(xj) has been generated by the joint importance distribution γ and is not a rare occurrence. If there is a higher degree of independence for coordinate index j this Monte Carlo estimate will be better. 5.2.2 Conditional Effective Sample Size Similarly to the standard SIR context, degeneracy of importance weights can occur for conditional resampling, with some sub-samples having much higher conditional importance weights than others for a given coordinate index j. To measure degener- acy, some heuristics are introduced here extending the notion of effective sample size. The marginal effective sample size (MESS) will be defined as ˜ 2 E[W(Xj)] MESSj = N , (5.1) ˜ 2 E[W(Xj) ] where W˜ indicates the unnormalized importance weight, one can similarly define an empirical version MESS\ . As explained above, one cannot generally compute the marginal distributions directly, so one can attempt to approximate MESS using conditional distributions. (i) (i) (i) (i) (i) When using the same-sample conditional importance weights Wj = π(xj |x[−j])/γ(xj |x[−j]), CHAPTER 5. CONDITIONAL SAMPLING IMPORTANCE RESAMPLING 37 this is referred to as the local conditional effective sample size (local CESS), 2 PN (i) i=1 Wj lCESS\ = 2 . (5.2) PN (i) i=1 Wj Alternatively, one can use the cross-sample conditional importance weights, (i,k) (i) (k) (i) (k) Wj = π(xj |x[−j])/γ(xj |x[−j]) to compute what will be called the global conditional effective sample size, 2 PN PN (i,k) i=1 k=1 Wj gCESS\ = 2 . (5.3) PN PN (i,k) i=1 k=1 Wj As when estimating the marginal distributions for sampling purposes, one can use a subset of indices instead of the full N for computing the Monte Carlo estimates. Note that lCESS and gCESS are simply heuristics used to estimate sample quality, and are not related to the convergence of CSIR. 5.2.3 Importance Weight Accounting Recall that a sample x has importance weight π(x) π(x |x )π(x ) W(x) = = j [−j] [−j] (5.4) γ(x) γ(xj|x[−j])γ(x[−j]) 0 ˜ 0 With CSIR, one is choosingx ˜j approximately according to π(·|x˜[−j]), therefore W(˜x ) ≈ π(˜x[−j])/γ(˜x[−j]) = W(˜x)/Wj(˜xj). Essentially, the algorithm is “resetting” the con- ditional weights for the resampled part. One can therefore perform a single step of CSIR for a samplex ˜, and update its importance weight accordingly. However, since this is an approximation, it is generally not advisable to perform this for several steps of CSIR, since it will lead to cascading errors. Instead, a better policy is to start x as a SIR sample to initially reset all importance weights. One can then assume that CHAPTER 5. CONDITIONAL SAMPLING IMPORTANCE RESAMPLING 38 importance weights remain the same for all samples. 5.3 Example: Multivariate Normal The following example considers a target distribution π as a multivariate Normal with mean 0 and covariance Σ using a standard multivariate Normal γ as an importance distribution. The random variables distributed according to π and γ are X and Y respectively, with X ∼ N(0, Σ) and Y ∼ N(0,In×n). Because one can sample directly from X as well as the conditional and marginal distributions, it allows for one for the easy evaluation of the performance of CSIR as compared to SIR. The importance variable, denoted Y, is a multivariate Normal with mean 0 and covariance α2I. Y admits an importance density γ such that, for x ∈ Rn, 1 γ(x) = exp(−xT x/2α2) (2π)n/2α The target random variable is denoted as X, and which is a multivariate Normal with mean 0 and covariance C. X admits a density π such that , for x ∈ Rn, 1 π(x) = exp(−xT C−1x/2) (2π)n/2|C|1/2 For these choices of importance and target distributions, one can explicitly com- pute the conditional distributions. Recall that if X can be decomposed as ! X X = a , (5.5) Xb where Xa and Xb are column vectors of size q and n − q respectively, then the conditional distribution of Xa given Xb is q dimensional multivariate normal with mean −1 µ = C12C22 Xb and covariance matrix −1 Σ = C11 − C12C22 C21. CHAPTER 5. CONDITIONAL SAMPLING IMPORTANCE RESAMPLING 39 (i) In the below examples, N samples {y }1...N are drawn from the importance dis- tribution and their importance weights are computed. To build the SIR sample (i) {x˜SIR}1...N , N samples are drawn with replacement from the importance samples with probability proportional to their importance weights. To build the CSIR sam- (i) ples {x˜CSIR}1...N , initialize each as a SIR sample, i.e. (i) (i) x˜CSIR =x ˜SIR, then run the CSIR Gibbs Markov step, updating each coordinate a total of k times. To test the quality of the approximate samples, a Monte Carlo estimate of the Kullback-Leibler divergence is computed. If π and γ are normally distributed with respective means µπ, µγ and covariance matrices Σπ, Σγ, then one can explicitly write the Kullback-Leibler divergence as 1 D (g, f) = tr(Σ−1Σ ) + (µ − µ )T Σ−1(µ − µ ) + log(det Σ / det Σ ) − n KL 2 π γ π γ π π γ π γ (5.6) However, due to the multinomial resampling of SIR and CSIR, the distribution of an approximate samplex ˜ cannot be expressed as a multivariate Normal. Instead the KL-divergence betweeng ˜ and π will be approximated. To estimate the cross-entropy, E[log π(˜x)], since samples have been drawn fromg ˜ and it is feasible to evaluate log(˜x), PN (i) it is possible to use the crude Monte Carlo estimator i=1 log π(y ). To estimate the entropy, E[log γ(˜x)], it is not possible to use crude Monte Carlo, since in neither SIR nor CSIR does one have an explicit representation of γ. Instead a density estimate gˆ is built. There are many possible ways to build a density estimateg ˆ, but in this case, since both the importance and target distributions are multivariate normal,g ˆ is taken to be multivariate normal and the maximum likelihood fit is used. Crude PN (i) Monte Carlo is then used withg ˆ to estimate the entropy, i=1 logγ ˜(y ). For the below examples, three cases are chosen: 1. Cid = αI. 2. Cunif , random matrix with eigenvalues chosen uniformly at random in (0, 1] CHAPTER 5. CONDITIONAL SAMPLING IMPORTANCE RESAMPLING 40 ● ● ● ● ● 0.7 1.0 1.0 ● ● ● ● ● ● ● ● ● 0.8 0.8 0.6 ● ● ● ● 0.6 0.6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● ● eigenvalue eigenvalue eigenvalue ● 0.4 0.4 ● ● ● 0.4 ● ● ● ● ● ● 0.2 ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.3 0.0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 (a) Cid (b) Cunif (c) Cskew Figure 5.1: CSIR Normal example: eigenvalues of covariance matrices scaled to have max eigenvalue 1. 3. Cskew, random matrix with eigenvalues chosen by a lognormal (µ = 0, σ = 1.5) in (0, 1] scaled to have max eigenvalue 1. The random matrices above were generated by taking QT ΣQ, where Q is a random orthogonal matrix. For the case of the uncorrelated target distribution, C = αI, with 0 < α < 1. In this case, the Kullback-Leibler divergence from the importance distribution γ to the target distribution π is 1 D (g, f) = (n/α + n log(α) − n) (5.7) KL 2 n = (1/α + log(α) − 1), (5.8) 2 so DKL(g, f) is linear in the dimension n. Using the iid properties of the importance and target distributions, one can see that for the conditional SIR approximate sample, " ˜ # f(XCSIR) DKL(gCSIR, f) = E log (5.9) gCSIR(˜xCSIR) n " ˜ # X f(XCSIR,i) = E log (5.10) g (˜x ) i=1 CSIR CSIR,i " # f(X˜ ) = nE log CSIR,1 , (5.11) gCSIR(˜xCSIR,1) CHAPTER 5. CONDITIONAL SAMPLING IMPORTANCE RESAMPLING 41 so for a given sample size N, DKL(gCSIR, f) also increases linearly in the dimension n. For the random covariance matrices, C is formed by pre- and post-multiplying the diagonal eigenvalue matrix with a random orthogonal matrix Q obtained by performing a QR-decomposition on a matrix composed of n draws from random uniform multivariate normal in n dimensions. ● 10 ● 4000 8 500 ● 3000 6 ● ● 300 2000 4 KL−Divergence KL−Divergence KL−Divergence ● ● ● 1000 2 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 ● ● 5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30 n n n (a) Cid (b) Cunif (c) Cskew Figure 5.2: Estimated KL-Divergence for approximate samples using CSIR and SIR, N = 50 samples, k = 5 passes of Gibbs sampler. I repeat this process for 20 different randomly chosen covariance matrices for the uniform and skewed. Dotted lines are 95% confidence intervals based on 1000 bootstrap samples. Note that in these experiments, CSIR has much lower KL-divergence than SIR for each example but particularly in the case of the random covariance matrices. Both methods perform worst on the skewed covariance matrix. Obviously, in these exam- ples one can easily sample directly from the target distribution, but they illustrate how CSIR can give a large improvement over SIR for high-dimensional state spaces. The next chapter uses this technique for a sequential importance sampling case where the state space is high-dimensional. CHAPTER 5. CONDITIONAL SAMPLING IMPORTANCE RESAMPLING 42 ● skew ● skew ● unif ● unif ● id ● id 60 4000 50 ● ● 3000 40 30 2000 KL−Divergence KL−Divergence 20 ● ● 1000 ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 5 10 15 20 25 30 5 10 15 20 25 30 n n (a) SIR (b) CSIR Figure 5.3: Same experiments as in Figure 5.2, plotted by method. Chapter 6 Multi-Object Particle Tracking This chapter applies conditional resampling in the particle filter context with an application to multi-object tracking (MOT). The novelty in this approach is that it allows the use of arbitrary jointly distributed forward movement and observation models, while maintaining asymptotic convergence properties and computational ef- ficiency. To make the particle filter computationally feasible under this joint model, the use of conditional sampling importance resampling for sequential Monte Carlo is introduced. This modified particle filter tracking algorithm can handle unknown or varying numbers of objects as well as the problem of association of observations with objects without making parametric assumptions on the nature of the forward model or resorting to ad-hoc steps. 6.1 Background 6.1.1 Single Object Tracking The task of inferring a single object path from a sequence of (possibly noisy) ob- servations is known as tracking. One way to formulate tracking is as an inference problem in a hidden Markov model (HMM). Recall that in an hidden Markov model, it is assumed that there is some underlying “hidden” Markov process X1:T of interest which cannot be observed directly, and a sample from an “observation” process Y1:T . 43 CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 44 X1:T is also called the state-space process. The simplest non-trivial hidden Markov model for tracking is a Gaussian state-space process X1:T coupled with a Gaussian n observation process Y1:T , with Xt, Yt ∈ R . This can be stated as Xt = Xt + N(0, Σs) (6.1) Yt = Xt + N(0, Σo) (6.2) In this case, one can use the Kalman filter [55] to solve the inference problem numer- ically. The Kalman filter is a sequential procedure that computes first E[Xt|Xt−1], then provides updates based on current observation yt to get E[Xt|, Xt−1, yt]. This is iterated over t = 1,...,T . Due to the Gaussian nature of the processes, know- ing E[Xt|, Xt−1, Yt] for each s is sufficient to know the full conditional distribution P[X1:t|Y1:t]. See Algorithm (8) for a step in the Kalman filter algorithm for this simple example. Algorithm 8 Kalman filter step for simple Gaussian process − Pt = Pt−1 + Σs − − −1 Kt = Pt (Pt + Σo) xt = xt−1 + Kt(yt − xt−1) − Pt = (I − Kt)Pt An alternative to the Kalman filter is particle filter tracking (or particle tracking). The modeling assumptions required to use the particle filter are quite general com- pared to the Kalman filter. The advantage to the particle filter is that more realistic, non-linear, and discontinuous observation and movement models may be specified; the disadvantage is an increase in computational complexity and implementation dif- ficulty, usually requiring Monte Carlo simulations instead of fast and relatively simple deterministic computations. Inference in tracking problems can often be posed as integrals or expected values. While there exist deterministic, quadrature type methods for particle filters, in gen- eral the integrals to be computed are high-dimensional, in which case deterministic methods tend to be intractable. For this reason, it is common to resort to Monte CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 45 Carlo methods for particle filtering (Algorithm 9), although some attempts at deter- ministic particle filtering have been made (see for example Doucet and De Freitas [31] ch. 5). Algorithm 9 Single object particle filter tracking (1) (N) Draw initial particles x0 , . . . , x0 from γ0. for t in 1 ...T do for each particle index i do (i)− (i) Sample xt ∼ γt(·|xt−1). (i) Compute importance weights wt . end for for each particle index i do (i) (1)− (1)− (i) Sample with replacement xt from xt , . . . , xt with prob. prop. to wt . end for end for 6.1.2 Multi Object Tracking Multi-object tracking is a large and diverse field, with many applications in the sci- ence and engineering community. It is closely related to the single-object tracking problem. As is noted in Vermaak et al. [100], most multi-object tracking algorithms fall into two categories. The first approach is to run multiple copies of individual single-object trackers, then post-processing the outputs from the single-object track- ers to handle occlusion and/or confusion. Interactive effects between objects and observations are assumed to be negligible or are dealt with heuristically. These al- gorithms tend to be fast and often work well in practice, but if interactive effects are non-negligible, significant bias may be introduced. Convergence to the optimal solution won’t be guaranteed as number of samples goes to infinity. The authors Hue et al. [48] give an early example of this approach, with an algorithm for multi-object tracking using an ‘association vector’ approach coupled with parallel particle filters. This model assumes independence in the observation and movement models, and uses a simulation-wide indicator test to determine whether to add or remove particles at a given time-step. CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 46 A second approach is to formally consider the multi-object tracking problem as an instance of single-object tracking by enlarging the state space to represent all object paths jointly as a single object. The process and observation models for this enlarged object can then be arbitrarily jointly distributed across all individual objects and observations. Computational techniques developed for single-object tracking can then be applied to make inferences in this model, such as particle filtering. In this vein, the authors of Khan et al. [56] give an algorithm for multi-object tracking that uses a Markov Random Field to define a joint movement distribution for interacting objects. The drawback from the enlarged state approach is that SIR becomes inefficient in high-dimensions. To see why naive SIR is generally ineffective in high dimensions, consider the example of n objects moving independently with no interaction. Parallel particle filters tracking each object individually would preferentially draw their most likely current states for each object. A naive joint-state particle filter implementation would select the best joint-state space particle, but any single joint particle is unlikely to contain all of the best individual current states. This line of inquiry shows that the naive joint-state particle filter would require a number of samples exponential in dimension size to achieve error comparable to parallel single object particle filters, which is impractical for even a modest number of tracking objects. To compensate for this effect, the algorithm described in this chapter takes advantage of the large degree of independence between groups objects that are not in close proximity to one another by applying conditional resampling. Other useful frameworks for multi-object tracking include the mixture particle filter [100] and the PHD filter [101]. For tutorials on general particle filter methods for see Chen [24], Doucet and De Freitas [31]. An overview of particle-based approaches to multi-object tracking is given in the thesis of Ozkan [80]. 6.1.3 Tracking Notation A time-step index t is a sequential coordinate index in Z+. An observation φ is n simply position in R , occurring at a time-step t. An observation set Yt is a set of CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 47 observations occurring at time-step t. Objects are assigned an index l, and object states are represented as ζl,t.A trajectory ζl,1:t, is a sequence of states representing the movements of a single object, also called a path. Note that trajectories may contain a leading and trailing sequence of null states ∅, indicating the object entered and/or left the field of view at some point in time. An observation event ψ is an intermediate variable, representing the process by which a set of observations was generated by a set of objects. The object state set Xt at time t is a set containing all of the states for all objects at time-step t, while the event state Ψt is the set of all object events ψ at time t. The enhanced state at time t is the pair (Xt, Ψt) and is ˆ denoted Xt.A particle x1:t is a set of trajectories representing the full joint evolution of the state space under the particle filter algorithm up to time-step t. Each particle has associated with it an importance weight W1:t. Note that after a SIR step, X1:t is an approximate sample from the target distribution at time t and w1:t is reset to 1 for each particle. state space, event, and observation process samples are denoted in lower case as x, Ψ, y respectively, and denote the corresponding random variables in upper case as X, Ψ, Y. In this chapter it is assumed that there are N particle samples from the state space, and use i as a sample index. The total number of time steps observed is denoted T . 6.2 Conditional SIR Particle Tracking In multi-object tracking, there are various discrete ‘decisions’ that a tracking algo- rithm must make. These decisions correspond to discrete events in the forward simu- lation model, such as the deletion/creation of new objects or occlusion, as well as the ‘association’ problem of assigning individual observations to objects. Addressing the association problem is the primary contribution of algorithms such as JDPAF [22]; this method and others deal with association via Monte Carlo. Dealing with unknown or changing numbers of objects is often addressed via ad-hoc methods such as defining a ‘decision rule’ in which objects are added/removed across all particles simultane- ously [48], or by assuming that the forward simulation follows some parametric form such as a mixture model [100]. CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 48 In the ‘enlarged state-space’ particle filter, both the deletion/creation and associ- ation problems are handled in an asymptotically correct way, but as previously noted, this approach yields poor computational complexity. Conditional SIR can be used to improve performance. To facilitate this, grouping subset functions are defined to decompose the state-space into two parts. A grouping subset function takes as input a particle, and returns a grouping subset, which can be used to apply the grouped Gibbs or the hit-and-run algorithms as described in §2.2.5. A key algorithmic design issue in CSIR is in choosing appropriate grouping subset functions. 6.2.1 Grouping Subsets for Multi-Object Tracking In the multi-object tracking context, one way in which to define grouping subset func- tions is to return sets of trajectories associated with individual observations φ ∈ yt. Gφ(x1:t) denotes the set of trajectories that are associated by events with observation φ for particle x up to time t. If there are no trajectories associated with φ in xt, i.e. φ was considered a false positive in xt, then the empty set ∅ is returned. As noted above, trajectories ζ1:t may include a leading and trailing set of “null” observations ∅, corresponding to the object entering and leaving the field of view. A simplified example of grouping based on observations is shown in Figure 6.1. Given a grouping subset G, it is possible to sample approximately from π(XG|X[−G]) using conditional SIR. If G is chosen independently of X, then this defines a Markov transition that leaves the stationary distribution invariant. Note that in the case of well-separated trajectories, applying conditional SIR to enlarged state-space particles is equivalent to running a separate particle filter for each object independently. 6.2.1.1 Choosing/Pruning Grouping Functions To choose for which groupings to apply conditional SIR, a table of groupings is kept with the empirical CESS for each grouping and is updated after each iteration. There is much potential redundancy in groupings, and updating the table is expensive, so this grouping table is typically ‘pruned’ . In general, it is possible to use arbitrary rules about which groupings to use based on aggregate particle sample properties without CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 49 " ! ! ! ! " " Figure 6.1: Grouping subset functions based on observations, denoted by enclosed dashed lines. Filled circles: observations. x’s, o’s: individual objects from particles 1 and 2 respectively. Note that in the grouping on the right, the particles differ in the number of associated object trajectories. affecting the asymptotic convergence properties. Pruning decision rules include only considering groupings that originated a constant number of time-steps in the past, not considering ‘children’ for well-established groups in subsequent time-steps, and removing ‘eliminated’ groups as determined by heuristics such as CSIR. As noted, a wide range of pruning algorithms can be used. 6.3 Application: Tracking Harvester Ants In this section, an application of the CSIR approach is given to tracking the move- ments of individual ants in a colony of harvester ants in a laboratory setting. 6.3.1 Object Detection A key step in multi-object tracking is object recognition, i.e. identifying what con- stitutes an object and what does not. To facilitate this task, the tracking algorithm use a sophisticated object detection software known as GemIdent [47]. Each pixel is classified independently as belonging to one of the specified object types (or none). The algorithm then forms a graph representation of classified pixels and use spectral CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 50 graph partitioning [91] to find clusters of pixels corresponding to individual objects. 6.3.1.1 Pixel Classification Supervised learning refers to the practice of building a learning algorithm with the aid of a set of training points. Interactive supervised learning is a technique that recursively applies supervised learning to predict a response variable by asking the user for input, training the algorithm based on the new input, reporting the results of the training to the user, and repeating until the user is satisfied. This interactivity streamlines the training process, allowing the user to initially identify a small number of example points, then add further points to correct mistakes made in the classifi- cation process. This has the effect of minimizing the total number of training points that the user has to provide, and shortens the total time involved in training process. As an input to the supervised learning algorithm, it is necessary to provide pre- computed features for each pixel to classify. The primary feature that is used here is a ring score. The algorithm first normalizes pixel values by using the Mahalanobis distance based on a pre-defined color set. To compute the ring score, take the average normalized value relative to color c of all pixels of radius r from the pixel of interest. The algorithm computes this value for each color c and radius r < R for some pre- defined maximum R. To incorporate movement information into pixel classification, the ring scores of the same pixel in adjacent frames are also included, both forward and backward in time. Once we’ve computed features for each pixel, random forests [18] are used to clas- sify each pixel based on computed features. Random forests are used because of their relative simplicity, and for the current application they empirically give comparable results to other techniques such as support vector machines (SVM). Random forests also have the advantage of being easily interpretable, as it is possible to directly compute the relative importance in each feature in the classification process. This approach was compared to other machine learning techniques in the publicly avail- able software library WEKA [37]. For an book length treatment on statistical and machine learning techniques, see Hastie et al. [45]. CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 51 6.3.1.2 Centroid Finding Each pixel is classified independently as belonging to one of the specified object types (or none). After pixels are classified, it is necessary to determine the number and locations of ants. The algorithm uses a flood-finding algorithm to form groups of adjacent pixels of the same type into “blobs”. Small, disparate blobs also occur due to incorrectly classified pixels. In addition, objects that are clustered together in an image will tend to produce large connected blobs. The current application needs to determine which blobs are noise and which, indicate the presence of ants and how many. To do this, the algorithm forms a graph of the blob by connecting adjacent pixels, then use spectral graph partitioning [91] to cut the blob. A good overview of spectral methods for graphs can be found in Spielman [95]. One way to think about spectral partitioning is as an approximation to the ’sparsest cut’ problem, which seeks to find a cut that maximizes the number of vertices separated but minimizes the number of edges cut. It is necessary to tune the parameters of this centroid finding step. Figure (6.2) shows two blobs and the results of spectral partitioning. 6.3.2 Observation Model While functioning relatively well in most cases, the GemIdent machine-learning al- gorithm is prone to error, due to the presence of background clutter, ant moving in close proximity to one another, unusual orientations of ants as they traverse objects, and other factors, confusing even human observers. Additionally, there are tradeoffs between the quality of the images, the interactive training time of the interactive supervised learning algorithm, computational efficiency, and the accuracy of the clas- sifications. It is desirable to have a tracking algorithm that can operate robustly in the presence of this relatively noisy process. Of particular concern is the propensity of the algorithm to split an individual ant into two observations or to merge two ants into a single observation. Under the observation model, each Yt is generated as the result of events that CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 52 Figure 6.2: Blob bisection via spectral partitioning occurs according to a probability distribution dependent on Xt. In the current ap- plication, there are two broad classes of object events: independent events, and joint events. Independent Normal observation events correspond to the canonical single tra- jectory/single observation case, with an observation occurring at a location according to a bivariate Normal distribution centered at the object location, with standard de- viation σo. Independent false negative events indicate no observation recorded for an individual object. This occurs for each object according to a Bernoulli random variable with parameter λn. Independent false positive events indicate a spurious ob- servation not directly caused by any proximate object. These occur according to a uniform 2D Poisson process with parameter λp. Splitting is a special kind of false pos- itive in which a pair of observations appear at the front and back of an object, due to overzealous segmentation. This occurs for each object according to a Bernoulli random variable with parameter λs. Finally, two adjacent objects i and j have a probability of ‘merging’ with their neighbors as a function of their distance, with an observation occurring near the center of the merged objects. Merging occurs in CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 53 Event type Description Parameters Independent observations Single trajectory with a single observation σo Independent false nega- No observation recorded for an individual λn tives object Independent false posi- Spurious observation not directly related λp tives to an object Splitting Single object yields pair of observations λs Merging May occur when objects are within close λm, αm proximity Table 6.1: Observation event types. 4 the joint model between two objects at distance d with probability αm exp(−d /λm). Individual objects may be involved in multiple mergings at the same time. 6.3.3 State-Space Model In the current application it is assumed that individual objects move according to independent affine Gaussian stochastic processes, with drift vectors depending on an estimate of the object’s velocity. For this model, it is possible to decompose the object 2 state at time t as ζt = {ut, vt}, where u, v ∈ R are the position and velocity vectors respectively, and write the model as 2 ut = ut−1 + vt−1 + N(0, σmI) (6.3) vt = α(ut − ut−1) + (1 − α)vt−1, (6.4) where σm is the movement dispersion parameter. The possibility that individual objects can enter or leave the field of view is modeled according to a birth-death process with rate parameters λb, λd (both birth and death rate uniform with respect to space and time). This is a relatively simple movement model, but for the current application it appears to suffice, as most of the complications and non-linearities come from the observation process. CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 54 6.3.4 Importance Distribution A critical consideration is the choice of importance distribution. As discussed in §3.2, ∗ the optimal importance distribution γ samples xt based on the previous object state and all current and future observations , i.e. ∗ γt (x) = π(xt|xt−1, yt:T ) (6.5) Since drawing from γ∗ is infeasible, the second best method practically available is often to use the locally optimal importance distribution, γt(x) = π(xt|xt−1, yt). (6.6) In the current context it is generally impossible to sample from this importance distribution directly, so instead the algorithm uses Markov chain Monte Carlo, in particular the data augmentation algorithm §2.2.4. To sample approximately from γt, the event set Ψt is used as an auxiliary variable for the data augmentation algorithm. The goal in this case is to draw a joint sample according to π(xt, Ψt|xt−1, yt), even though the primary interest is in xt. The algo- rithm does this by first generating a draw Ψt conditioned on xt, then xt conditioned on Ψt, or Ψt ∼ π(Ψt|xt, xt−1, yt) (6.7) xt ∼ π(xt|Ψt, xt−1, yt), (6.8) and repeating. As a starting point, an initial value of xt from one forward step from the object state-space model is used, i.e. xt is drawn according to π(xt|xt−1). Algorithm 10 Sampling π(xt, Ψt|yt, xt−1) via data augmentation. Start projected trajectory positions xt. repeat Sample π(Ψt|xt, yt) via Algorithm 12. Sample π(xt|Ψ, yt, xt−1) as described in §6.3.4. until converged CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 55 6.3.4.1 Sampling from State-Space Given Events Sampling from π(xt|Ψt, xt−1, yt) is decomposed by event types. Note that at this step, resampling xt is resampling the set of object positions, as object birth/death movements are considered when sampling from events. The algorithm cycles through each element of Ψt and perform actions based on case. Independent normal observation Both movement and observation processes are following Gaussian processes, so the object location can be modeled as bivariate Normal. This is due to the following relationship. π(yt|xt)π(xt|xt−1) π(xt|xt−1, yt) = (6.9) π(yt|xt−1) ∝ π(yt|xt)π(xt|xt−1) (6.10) When the event type is independent Gaussian movement and observation, both π(yt|xt) and π(xt|xt−1) are Normal, and the product of the distributions will also be Normal. Since for both the observation and movement models independence in the covariance structure of the 2D coordinate axes is assumed, the distribution π(ζt|ζt−1, φt) can be written as the product of the distribution for each coordinate axes. The coordinate-wise standard deviation and mean of the object position ut will then be s 2 2 σmσo σut = 2 2 (6.11) σm + σo 1 2 2 µut = 2 2 (ut−1 + vt−1) σo + ytσm , (6.12) σm + σo 2 and ut can be expressed as a N(µut , σut I) random variable. Independent false negative In this case there is no observation information, so the object state is updated according to the forward movement model (6.3). CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 56 Independent false positive Independent false positives are not related to any objects, so no object states need to be updated. Splitting Splitting is represented as an independent normal observation, using the average of the two points as the observation center. Merging/Splitting Given a series of mergings and splittings Ψˆ , associated subsets of observationsy ˆt, and a subset of object pointsx ˆt, it is possible to compute the ˆ probability of the event set occurring as π(Ψ, yˆt|xˆt, xt−1). In order to sample from ˆ π(ˆxt|Ψ, yˆt), the algorithm samples from a Metropolis-Hastings independence chain with proposal density Q(ˆxt) = π(ˆxt|xˆt−1). The acceptance probability of a new state 0 xˆt will then be π(Ψˆ , yˆ |xˆ0 , x ) a = t t t−1 (6.13) ˆ π(Ψ, yˆt|xˆt, xt−1) 6.3.4.2 Sampling from Events Given State-Space The second half of the data augmentation algorithm for sampling from the importance distribution (6.6) is to sample the events Ψ given the state-space positions xt. γ(Ψt|xt) = π(Ψt|xt, yt) (6.14) To sample from γ(Ψt|xt) a Metropolis-Hastings Markov chain is used. This entails drawing samples from a proposal chain κ, and accepting/rejecting them based their likelihoods relative to the current state. The underlying Markov chain used in the current approach is based on sampling from object/observation associations. The global version is as follows. Every object is (independently) associated with an observation with probability as a decreasing function of the distance, f(d(ζ, φ)). It is then possible to directly evaluate the likeli- hood of this association. The algorithm samples from the target distribution using a Metropolized independence chain. This global proposal distribution is referred to as CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 57 !"#$%&' !"'$()*&+,-' Figure 6.3: Association of objects with observations. ’Events’ correspond to con- nected components in this bipartite graph, including Normal observations, splitting, merging, false positives, false negatives, and joint events. κ. For an edgeset E, one can express κ as Y Y Y κ(E) = f(d(ζ, φ)) (1 − f(d(ζ, φ))) (6.15) ζ (ζ,φ)∈E (ζ,φ)6∈E f should be chosen to be close to the probability of an observation with position φ being generated in an event involving an object at location ζ. If d(ζ, φ) < σo it’s almost certain, d(ζ, φ) < 2σo its pretty likely, d(ζ, φ) > 3σo probably not. The algorithm assumes this takes a parametric form such as f(d) = α exp(−dp/β), (6.16) where p and β are chosen based on σo, and 0 < α ≤ 1 is perhaps related to the false positive and merging probabilities. An example would be to take α = .95, p = 4, and p β = −(3σo) / log .01, which gives the following plot. CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 58 1.0 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●● ●●●●●● ●●●●● ●●●●● ●●●● ●●●● ●●●● ●●● ●●● ●●● ●●● ●●● ●●● 0.8 ● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● 0.6 ●● ●● ●● ●● ●● ●● ●● p ●● ●● ●● ●● ●● ●● ●● ●● ●● 0.4 ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● 0.2 ● ●● ●● ●● ●●● ●●● ●●● ●●● ●●● ●●● ●●●● ●●●● ●●●●● ●●●●●● ●●●●●●● ●●●●●●●●●●●● 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 d/simga_o There are several local versions. One is, pick an object at random and resample its associations, accept according to Metropolis. This is an instance of the “single- particle Metropolis”. Another variation on this would be to resample the associations for a subset of objects. This could be random, or something like taking as a subset the k nearest neighboring objects to a randomly chosen object or observation, or all objects within d distance to a randomly chosen object or observation. k or d could be randomly chosen, or a set constant. This procedure will be ergodic if choices are centered around objects (nonzero probability for every bipartite graph), and ergodic for choices centered around observations if k or d are large enough such that every object is close enough to an observation to be resampled at some point. Refer to the current edge set as E, the proposal edge set as E0. The Metropolis acceptance value a is γ(E0) K(E0,E) a = (6.17) γ(E) K(E,E0) γ(E0) κ(E) = (6.18) γ(E) κ(E0) CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 59 The ratio can be directly computed as κ(E) Y f(d(ζ, φ)) Y 1 − f(d(ζ, φ)) = (6.19) κ(E0) 1 − f(d(ζ, φ)) f(d(ζ, φ)) (ζ,φ)∈E,6∈E0) (ζ,φ)∈E0,6∈E) and similarly for the ratio γ(E0) (6.20) γ(E) Algorithm 11 Sampling proposal event set Ψ via association distribution κ. Form bipartite association probability graph A(xt, yt) using association probability function f. Draw edges E independently w/prob. proportional to edge weights. Find connected components for E, form event set Ψ. Return Ψ. Each edge set E defines an event set Ψ by the connected components in the association graph of the edge set. However, this relationship is not injective, as the same event set Ψ may result from multiple different edge sets, since connected components for joint events may be defined in multiple ways. Then κ(E) κ(Ψ) = (6.21) κ(E|Ψ) X = κ(E0) (6.22) E07→Ψ = E[κ(E)1E7→Ψ] (6.23) This can be decomposed by event components. For each ψ ∈ Ψ, define Eψ to be the associated set of edges generated by κ for the objects and observations associated with ψ. Then κ(Ψ) can be decomposed as the probability that there are no edges between objects and observations not in the same events, multiplied by the probability of CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 60 connectivity within event components. Y κ(Ψ) = κ ({E \ ∪ψ∈ΨEψ} = ∅) κ(Eψ 7→ ψ) (6.24) ψ∈Ψ To compute the first term on the RHS of (6.24), the probability that there are no cross-component edges, simply compute Y κ ({E \ ∪ψ∈ΨEψ} = ∅) = (1 − κ((ζ, φ))) , (6.25) (ζ,φ)6∈Ψ i.e. the product of one minus the probability of each non-component edge occurring. Since the association probability graph is effectively sparse (probability of an edge between an object and an observation effectively 0 if distance is more than 4σo), this probability can be computed quickly. To compute the individual component probabilities, κ(Eψ 7→ ψ), for small numbers of objects plus observations it is feasible perform the calculations directly through enumeration. For large joint distributions, the algorithm estimates this probability through Monte Carlo estimation of κ(Eψ 7→ ψ) = E[κ(Eψ)1Eψ7→ψ] (6.26) To compute the probability of an event set Ψt under γ, note that γ(Ψt|xt) = π(Ψt|xt, yt) (6.27) π(Ψ , x , y ) = t t t (6.28) π(xt, yt) π(y |x , Ψ )π(Ψ |x )π(x ) = t t t t t t (6.29) π(xt, yt) ∝ π(yt|xt, Ψt)π(Ψt|xt) (6.30) Both of these probability distributions, π(yt|x,Ψt) and π(Ψt|xt), are defined by the forward observation model. The first term can be decomposed as the product of the CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 61 individual event probabilities. This can be written as Y π(yt|xt, Ψt) = π(yt,ψ|xt,ψ, ψ), (6.31) ψ∈Ψt where yt,ψ and xt,ψ are the subsets of yt and xt associated with event ψ. For the second term on the RHS of (6.30),π(Ψt|xt), recall that the ’event’ random variable corresponds to a connected components in the association graph but doesn’t carry any positional information. So this is the probability that the set of trajectories xt,ψ produced a (joint) event, times the probability that these events produced the correct number of observations |yt,ψ| given the object associations. The joint eventset probability can be written as π(Ψt|xt) = π(Ψyt , Ψxt |xt) (6.32) = π(Ψyt |Ψxt , xt)π(Ψxt |xt) (6.33) To compute π(Ψxt |xt), the ’object association probability’ for the joint forward model is used to compute the probability of each component being connected times the probability of there being no cross-component edges, or ! Y Y 0 π(Ψxt |xt) = π(ψ|xt,ψ) (1 − π((φ, φ )|xt)) (6.34) 0 ψ∈Ψ (φ,φ )6∈Ψx Combining these yields ! Y Y 0 γ(Ψ|xt) ∝ π(yt,ψ, ψ|xt,ψ) (1 − π ((φ, φ ) |xt)) , (6.35) 0 ψ∈Ψ (φ,φ )6∈Ψx where π(yt,ψ, ψ|xt,ψ) can be broken down as π(yt,ψ, ψ|xt,ψ) = π(yt,ψ|xt,ψ, ψ)π(ψy|ψx, xt,ψ)π(ψx|xt,ψ) (6.36) CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 62 Algorithm 12 Metropolis sampling of π(Ψt|xt, yt). Start with event set Ψ repeat Ψ0 ← Ψ. Pick a random subset of objects S ⊂ xt, using local or global criteria. Remove events ψ ∈ Ψ0 associated with S. Randomly sample new events for S according to κ, add to Ψ0. Compute acceptance probability a(Ψ, Ψ0) via (6.18). With probability min(1, a), Ψ ← Ψ0 until converged 6.3.5 Computing Relative and Marginal Importance Weights Another main element of the importance sampling algorithm is the computation of importance weights. It is of interest to compute both the relative importance weights for an entire particle, and the marginal importance weights for grouping subsets. 6.3.5.1 Computing Particle Relative Importance Weights The importance weight contribution from time t will be πt(x) Wt(x) = . (6.37) γt(x) Since γt(x) ∝ P(xt|Xt−1, yt), the contribution will be proportional to w˜t(x) = P(yt|xt−1) (6.38) This can be computed via Monte Carlo as P(yt|xt−1) = EXt|xt−1 [P(yt|Xt)], (6.39) where xt is drawn from P(xt|xt−1). However, this estimator may have high variance, since for any given xt, it may be unlikely to generate a given yt. From the sampling step there are approximate samples from P(xt, Ψt|xt−1, yt−1), which is actually the op- timal importance distribution for estimating (6.39). However, this probability is only CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 63 known up to a normalizing constant (the normalizing constant is in fact P(yt|xt−1)). One way of computing this normalizing constant is observing the following. P(yt|xt, Ψt)P(xt, Ψt|xt−1) P(yt|xt−1) = (6.40) P(xt, Ψt|yt, xt−1) P(x , Ψ |y , x ) P(x , Ψ |x ) t t t t−1 = t t t−1 (6.41) P(yt|xt, Ψt) P(yt|xt−1) Integrating with respect to xt, Ψt gives 1 1 = EXt,Ψt|yt,xt−1 (6.42) P(yt|xt−1) P(yt|Xt, Ψt) −1−1 P(yt|xt−1) = EXt,Ψt|yt,xt−1 P(yt|Xt, Ψt) . (6.43) Also note that computing P(yt|xt, Ψt) is straightforward, this is just the product of the the observation probabilities for each event ψ ∈ Ψt, which are being computed anyways to generate the joint samples. This means a Monte Carlo estimate based on −1 EXt,Ψt|yt,xt−1 [P(yt|Xt, Ψt) ] can be used to estimate P(yt|xt−1), and can build this estimate concurrently while building the joint sample Xt, Ψt|yt, xt−1. 6.3.5.2 Computing Marginal Importance Weights In order to perform resampling, it is necessary to estimate the marginal importance weights, (i) (i) (i) Wj = π(xj |x˜[−j])/γ(xj ) (6.44) This computation depends on the choice of “coordinate index” j, which corre- sponds to a partition of the state space as discussed in §2.2.5. To choose this par- tition,the extended state-space (xt, Ψt) that includes the event set Ψs will be used. There are several different possible ways to choose partitions, as discussed in §6.2.1. The main purpose of conditional resampling is to allow for ’revisions’ across multiple time-steps by ’grafting’ sub-trajectories from one particle onto another. This is nec- essary because events represent realizations of ’decision points’, and decisions that CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 64 are plausible at timestep t may be much less plausible compared to alternatives when taking into account future observations. As previously noted, there are many possible grouping functions that may be chosen to which to apply conditional resampling. In the current application, tra- jectory groupings are formed based on well-separated observations. Recall that it is possible to choose grouping functions arbitrarily based on yt without introducing bias. The goal in this case is to form groups of individual trajectories based on single observations. The algorithm then keeps this grouping when/if the set of trajectories enters a joint merging/splitting setting, and decides to discard the grouping based on aggregate properties of the group. To resample across the grouping, positions from one trajectory are ’grafted’ onto another using the last k time-steps, the resulting marginal importance weights are estimated, and resampling is performed based on these weights. Algorithm 13 Conditional sampling importance resampling for particles. Determine which grouping sets to resample, forming new groupings at each obser- vations and pruning old groupings using criteria such as age and CESS. (1) (N) Resample particles using SIR to getx ˜t ,..., x˜t . for each grouping j do for i in 1,...,N do Compute marginal importance weights using (6.44) Choose grouping k with probabilities proportional to marginal importance weights (k) (i) Graft grouping Gj(˜xt ) ontox ˜ according to §6.3.5.2. end for end for 6.4 Empirical Results 6.4.1 Simulated Data For the previous example, the “true” hidden state of the objects is unknown, as the example comes from real-world data. In this example, samples from the hidden state-space are simulated using the Gaussian forward model (6.3), and observations CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 65 Algorithm 14 Multi-object particle tracking for Harvester ants. (1) (N) Initialize particles x1 , . . . , x1 based on prior distribution/first observations y1. (1) (N) Compute initial importance weights w1 , . . . , w1 . for t in 2 ...T do (i) (i) Advance each particle by sampling (xt , Ψt ) ∼ γt via Algorithm 10. Estimate importance weights for each particle via (6.43) using samples from previous step. Apply conditional SIR to construct new particles via Algorithm 13. end for are drawn according to the observation model (§6.3.2). Choosing samples from this known, tractable model allows us to directly evaluate results from the particle tracking algorithm. The simulated data starts with 100 objects randomly distributed uniformly in a 500 by 500 grid. The simulation then proceeds by moving the objects according 2 to the forward model, with movement standard deviation parameter σm = 2. The 2 observation model is simulated with observation standard deviation parameter σo = 1.5, split probability parameter λs = .006 for each object at each time-step, and with a merging probability of λm = .45 (decreasing with distance). The results for one state-space sample with path lengths and objects per frame are summarized in Figure (6.4), and “observed” samples with the number of observations per frame are summarized in Figure (6.5). We then ran this example using particle tracking, first using a sample from the importance distribution (6.6), then using CSIR (6.7). As in the previous example, CSIR produces much longer path lengths, closely matching the “true” number of paths that last the entire simulation. 6.4.2 Short Harvester Ant Video As a first example of the tracking algorithm, we used GemVident to find centroids for a video file that has 753 frames. A screenshot of the centroid finding results for a single frame from GemVident is shown in Figure 6.8. The centroid finding algorithm for this video resulted in an average of 109 observations per frame, with CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 66 Figure 6.4: “True” distribution of path lengths and trajectories per frame, simulated example. Figure 6.5: Centroid observations per frame, simulated example. the distribution of observations shown in Figure 6.9. For this example, we first drew a sample from the importance distribution and inspect the quality of the sample. A video file of a draw from the importance distri- bution can be found here: http://www.stanford.edu/~guetz/particleTracking/ noCSIR.mp4. The blue dots represent trackings, the red dots are the centroids from GemVident. Note that the trackings tend to follow the observations for a short number of time-steps before getting lost. One can see this directly by looking at a histogram of the trajectory lengths over the sample (Figure 6.10). We then applied CSIR particle filter to this example using 20 particles. A video file of the CSIR tracking can be found here: http://www.stanford.edu/~guetz/ CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 67 Figure 6.6: Distribution of path lengths and trajectories per frame using a sample from the importance distribution, simulated example. Figure 6.7: Distribution of path lengths and trajectories per frame using CSIR, sim- ulated example. particleTracking/CSIR.mp4. Note that this is much better. Looking at the trajec- tory lengths (Figure 6.11), note that a large percentage of the trajectories last the entire simulation. Also note that for the number of trajectories per frame matches the number of observations much more closely for both the sample from the importance distribution and CSIR. This example was repeated for N = 5, 20, 40 time-steps. N = 5 took 32 seconds, N = 20 took 126 seconds, N = 40 took 253 seconds. Note that the increase is essentially linear in the number of particles. This is due to the fact that most grouping subsets are well separated from each other, so it is possible to use the a single marginal importance weight computation for most of the particle/grouping subset pair instead CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 68 of the more complicated O(N 2) version. Figure 6.8: GemVident screenshot, showing centroids. Figure 6.9: Centroid observations per frame from Harvester ant example. CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 69 Figure 6.10: Distribution of path lengths and trajectories per frame using a sample from the importance distribution, Harvester ant example. Figure 6.11: Distribution of path lengths and trajectories per frame using CSIR, Harvester ant example. Chapter 7 Network Growth Models In recent years there has been much interest in the study of generative models that explain observed properties of networks derived from biology, sociology and computer science. The next two chapters will discuss Monte Carlo methods for performing infer- ence for network data structures. Among the most widely researched models include network growth models such as preferential attachment [8] and duplication/divergence (also known as vertex copying) [59, 25]. One reason for the focus on dynamic models of network growth is that it allows researchers to understand how networks develop over time, with relatively simple rules resulting in surprising complexity. For a sur- vey of the history and applications on network growth models see Newman [78], and Durrett [32] for a more recent book-length treatment. The primary motivations for studying generative models of network growth are to help explain, understand, and predict observed network data structures and features. For example, many real-world networks have degree sequences that appear to follow heavy-tailed distributions, such as the so-called power-law distributions, where the probability of a vertex having degree k is proportional to k−α for α > 0. The classical models of random networks, such as the Erd¨os-R´enyi G(n, p) model in which edges are placed between each pair of vertices according to independent Bernoulli trials with probability p, have asymptotically Poisson degree distributions making them unsuitable for modeling networks with heavy-tailed degree distributions. In contrast, many network growth models such as preferential attachment and vertex copying 70 CHAPTER 7. NETWORK GROWTH MODELS 71 produce asymptotically power-law networks. 7.1 Background Many realistic models of network formation, whether in regards to sociological, bio- logical, or computer network, involve growth from some smaller “seed” network. A network growth model is a stochastic process through which a network G is formed by adding vertices to a network one by one, with edges created between the new vertex and pre-existing vertices. Well known examples of network growth models include so called preferential attachment and vertex duplication models. In the current context, network growth models are considered. Unless otherwise stated, graphs are assumed to be are simple, i.e. they are undirected, have no self- loops, there are no multiple edges, and there are no edge weights. Furthermore, the models used are assumed to be strict growth models, i.e. removal of vertices or edges is not allowed, edges cannot be “rewired” once they appear, and new edges always have one endpoint in the newly added vertex. The advantage of modeling networks through strict growth mechanisms as opposed to a more general model that might allow edge and vertex deletions or rewirings is that they are easier to analyze and simulate. One issue with many commonly used network growth models is that they are often formulated in a way that is not statistically well-defined, with most real-world networks having zero measure. For example, the original Barab´asi-Albert preferential attachment model adds a deterministic number of edges k at each time-step; if a network does not have kn edges, then it has zero likelihood. Similar complaints can be made about many other common models, such as the vertex copying and forest fire models. The primary reason these models are not well-defined statistically is that they were formulated in order to make theoretical analysis of their aggregate properties easier, such as showing they give power-law degree distribution or are well- connected. In the current context, however, it is preferable to use models that are statistically well-defined. There have been severals attempt to make network growth models more statistically robust, see for example Leskovec et al. [65] and Sheridan CHAPTER 7. NETWORK GROWTH MODELS 72 et al. [90]. Given the order in which nodes arrived in an observed network, one can compute the likelihood of the network under a growth model by multiplying the likelihood contribution of each vertex as it enters the network. In other words, to compute the likelihood of a graph G given an ordering µ and model parameters θ, one simply forms the product n Y W(G|θ, µ) = P(xµ(i)|xµ(1), . . . , xµ(n−1)), (7.1) i=1 where xµ(i) is the event of the µ(i)th vertex entering G along with its corresponding edges. Denote W(G|θ, µ) the order likelihood of G. This representation immediately yields a mechanism for computing the full likeli- hood of G by summing the order likelihoods (7.1) µ over Πn, the space of all possible orderings of length n, X W(G|θ) = W(G|θ, µ)P(µ). (7.2) µ∈Πn P(µ) is the prior distribution over the space of permutations πn. In this chapter P(µ) is assumed to be uniform over the space of permutations, but in practice one may have prior knowledge about the order in which vertices entered the network and can construct a different prior distribution. Since there are n! possible permutations of length n, brute force enumeration is infeasible except for the smallest graphs. A natural idea is thus to construct an ˆ unbiased Monte Carlo estimator W(G|θ) of (7.2) by drawing N iid samples µ1, . . . , µN distributed according to P(µ): N ˆ X W(G|θ) = W(G|θ, µi)P(µi). (7.3) i=1 In general the estimator Wˆ may have a high variance, making the Monte Carlo estimator little better than brute force. Fortunately, one can usually reduce the CHAPTER 7. NETWORK GROWTH MODELS 73 variance of Wˆ by using the importance sampling techniques described in §2.1. 7.1.1 Erd¨os-R´enyi Perhaps the simplest model of network growth is uniform attachment (UA), where edges are are attached from a new vertex to pre-existing vertices independently at random with uniform constant probability p. This process is equivalent to the well- studied Erd¨os-R´enyirandom graph model G(n, p). The G(n, p) and closely related G(n, m) models were introduced by Gilbert [41] in 1959 and subsequent analyzed by Erdos and Renyi [34]. For comprehensive book length treatments of G(n, p) and G(n, m) see Bollob´as[16], Janson et al. [50] and also Durrett [32]. The advantage of the G(n, p) is in the simplicity of analysis. Vertex degrees are distributed according to a Binomial Bin(n − 1, p) distribution. Since edges are added independently according to a Bernoulli with parameter p, the number of edges m in G is a sufficient statistic. The likelihood of G given parameter p is simply n W(G|p) = P Bin , p = m , (7.4) 2