Lecture 14: Bayesian Networks II

Lecture 14: Bayesian networks II CS221 / Spring 2018 / Sadigh Review: Bayesian network C A P(C = c; A = a; H = h; I = i) def= p(c)p(a)p(h j c; a)p(i j a) H I Definition: Bayesian network Let X = (X1;:::;Xn) be random variables. A Bayesian network is a directed acyclic graph (DAG) that specifies a joint distribution over X as a product of local conditional distributions, one for each node: n Y P(X1 = x1;:::;Xn = xn) = p(xi j xParents(i)) i=1 CS221 / Spring 2018 / Sadigh 1 • Last time, we talked about Bayesian networks, which was a fun and convenient modeling framework. We posit a collection of variables that describe the state of the world, and then create a story on how the variables are generated (recall the probabilistic program interpretation). • A Bayesian network specifies two parts: (i) a graph structure which governs the qualitative relationship between the variables, and (ii) local conditional distributions, which specify the quantitative relationship. • Formally, a Bayesian network defines a joint probability distribution over many variables (e.g., P(C; A; H; I)) via the local conditional distributions (e.g., p(i j a)). This joint distribution specifies all the information we know about how the world works. Review: probabilistic inference Input Bayesian network: P(X1 = x1;:::;Xn = xn) Evidence: E = e where E ⊆ X is subset of variables Query: Q ⊆ X is subset of variables Output P(Q = q j E = e) for all values q Example: if coughing but no itchy eyes, have a cold? P(C j H = 1;I = 0) CS221 / Spring 2018 / Sadigh 3 • Think of the joint probability distribution defined by the Bayesian network as a guru. Probabilistic inference allows you to ask the guru anything: what is the probability of having a cold? What if I'm coughing? What if I don't have itchy eyes? In this lecture, we're going to build such a guru that can answer these queries efficiently. Roadmap Forward-backward Gibbs sampling Particle filtering CS221 / Spring 2018 / Sadigh 5 Object tracking H1 H2 H3 H4 E1 E2 E3 E4 Problem: object tracking Hi 2 f1;:::;Kg: location of object at time step i Ei 2 f1;:::;Kg: sensor reading at time step i Start: p(h1): uniform over all locations Transition p(hi j hi−1): uniform over adjacent loc. Emission p(ei j hi): uniform over adjacent loc. CS221 / Spring 2018 / Sadigh 6 • So far, we have computed various ad-hoc probabilistic queries on ad-hoc Bayesian networks by hand. We will now switch gears and focus on the popular hidden Markov model (HMM), and show how the forward-backward algorithm can be used to compute typical queries of interest. • As motivation, consider the problem of tracking an object. The probabilistic story is as follows: An object starts at H1 uniformly drawn over all possible locations. Then at each time step thereafter, it transitions to an adjacent location with equal probability. For example, if H2 = 3, then H3 2 f2; 4g with equal probability. At each time step, we obtain a sensor reading Ei which is also uniform over locations adjacent to Hi. Hidden Markov model H1 H2 H3 H4 H5 E1 E2 E3 E4 E5 n n Y Y P(H = h; E = e) = p(h1) p(hi j hi−1) p(ei j hi) | {z } i=2 | {z } i=1 | {z } start transition emission Query (filtering): P(H3 j E1 = e1;E2 = e2;E3 = e3) Query (smoothing): P(H3 j E1 = e1;E2 = e2;E3 = e3;E4 = e4;E5 = e5) CS221 / Spring 2018 / Sadigh 8 • In principle, you could ask any type of query on an HMM, but there are two common ones: filtering and smoothing. • Filtering asks for the distribution of some hidden variable Hi conditioned on only the evidence up until that point. This is useful when you're doing real-time object tracking, and you can't see the future. • Smoothing asks for the distribution of some hidden variable Hi conditioned on all the evidence, including the future. This is useful when you have collected all the data and want to retroactively go and figure out what Hi was. Lattice representation H1 =1 H2 =1 H3 =1 H4 =1 H5 =1 start H1 =2 H2 =2 H3 =2 H4 =2 H5 =2 end H1 =3 H2 =3 H3 =3 H4 =3 H5 =3 • Edge start ) H1 = h1 has weight p(h1)p(e1 j h1) • Edge Hi−1 = hi−1 ) Hi = hi has weight p(hi j hi−1)p(ei j hi) • Each path from start to end is an assignment with weight equal to the product of node/edge weights CS221 / Spring 2018 / Sadigh 10 • Now let's actually compute these queries. We will do smoothing first. Filtering is a special case: if we're asking for Hi given E1;:::;Ei, then we can marginalize out the future, reducing the problem to a smaller HMM. • A useful way to think about inference is returning to state-based models. Consider a graph with a start node, an end node, and a node for each assignment of a value to a variable Hi = v. The nodes are arranged in a lattice, where each column corresponds to one variable Hi and each row corresponds to a particular value v. Each path from the start to the end corresponds exactly to a complete assignment to the nodes. • Note that in the reduction from a variable-based model to a state-based model, we have committed to an ordering of the variables. • Each edge has a weight (a single number) determined by the local conditional probabilities (more generally, the factors in a factor graph). For each edge into Hi = hi , we multiply by the transition probability into hi and emission probability p(ei j hi). This defines a weight for each path (assignment) in the graph equal to the joint probability P (H = h; E = e). • Note that the lattice contains O(Kn) nodes and O(K2n) edges, where n is the number of variables and K is the number of values in the domain of each variable. Lattice representation H1 =1 H2 =1 H3 =1 H4 =1 H5 =1 start H1 =2 H2 =2 H3 =2 H4 =2 H5 =2 end H1 =3 H2 =3 H3 =3 H4 =3 H5 =3 Forward: F (h ) = P F (h )w(h ; h ) i i hi−1 i−1 i−1 i−1 i sum of weights of paths from start to Hi = hi Backward: B (h ) = P B (h )w(h ; h ) i i hi+1 i+1 i+1 i i+1 sum of weights of paths from Hi = hi to end Define Si(hi) = Fi(hi)Bi(hi): sum of weights of paths from start to end through Hi = hi CS221 / Spring 2018 / Sadigh 12 • The point of bringing back the search-based view is that we can cast the probability queries we care about in terms of sums over paths, and effectively use dynamic programming. • First, define the forward message Fi(v) to be the sum of the weights over all paths from the start node to Hi = v . This can be defined recursively: any path that goes Hi = hi will have to go through some Hi−1 = hi−1 , so we can so over all possible values of hi−1. • Analogously, let the backward message Bi(v) be the sum of the weights over all paths from Hi = v to the end node. • Finally, define Si(v) to be the sum of the weights over all paths from the start node to the end node that pass through the intermediate node Xi = v . This quantity is just the product of the weights of paths going into Hi = hi (Fi(hi)) and those leaving it (Bi(hi)). Lattice representation Smoothing queries (marginals): P(Hi = hi j E = e) / Si(hi) Algorithm: forward-backward algorithm Compute F1;F2;:::;FL Compute BL;BL−1;:::;B1 Compute Si for each i and normalize Running time: O(LK2) CS221 / Spring 2018 / Sadigh 14 • Let us go back to the smoothing queries: P(Hi = hi j E = e). This is just gotten by normalizing Si. • The algorithm is thus as follows: for each node Hi = hi , we compute three numbers: Fi(hi);Bi(hi);Si(hi). First, we sweep forward to compute all the Fi's recursively. At the same time, we sweep backward to compute all the Bi's recursively. Then we compute Si by pointwise multiplication. • Implementation note: we technically can normalize Si to get P(Hi j E = e) at the very end but it's useful to normalize Fi and Bi at each step to avoid underflow. In addition, normalization of the forward messages yields P(Hi = v j E1 = e1;:::;Ei = ei) / Fi(v). Summary • Lattice representation: paths are assignments (think state-based models) • Dynamic programming: compute sums efficiently • Forward-backward algorithm: share intermediate computations across different queries CS221 / Spring 2018 / Sadigh 16 Roadmap Forward-backward Gibbs sampling Particle filtering CS221 / Spring 2018 / Sadigh 17 Particle-based approximation P(X1;X2;X3) Key idea: particles Use a small set of assignments (particles) to represent a large probability distribution. x1 x2 x3 P(X1 = x1;X2 = x2;X3 = x3) 0 0 0 0.18 0 0 1 0.02 x x x 0 1 0 0 1 2 3 Sample 1 0 0 0 0 1 1 0 Sample 2 1 1 1 1 0 0 0 Sample 3 1 1 1 1 0 1 0 Estimated marginals 0.67 0.67 0.67 1 1 0 0.08 1 1 1 0.72 0.8 0.8 0.74 (true marginals) CS221 / Spring 2018 / Sadigh 18 • The central idea to both of Gibbs sampling and particle filtering is the use of particles (just a fancy word for complete assignment, usually generated stochastically) to represent a probability distribution.

Load more