Lecture 14: Bayesian Networks II Review: Bayesian Network

Lecture 14: Bayesian networks II Review: Bayesian network C A P(C = c; A = a; H = h; I = i) def= p(c)p(a)p(h j c; a)p(i j a) H I Definition: Bayesian network Let X = (X1;:::;Xn) be random variables. A Bayesian network is a directed acyclic graph (DAG) that specifies a joint distribution over X as a product of local conditional distributions, one for each node: n Y P(X1 = x1;:::;Xn = xn) = p(xi j xParents(i)) i=1 CS221 / Spring 2019 / Charikar & Sadigh CS221 / Spring 2019 / Charikar & Sadigh 1 • Last time, we talked about Bayesian networks, which was a fun and convenient modeling framework. We posit a collection of variables that describe the state of the world, and then create a story on how the variables are generated (recall the probabilistic program interpretation). Review: probabilistic inference • A Bayesian network specifies two parts: (i) a graph structure which governs the qualitative relationship between the variables, and (ii) local conditional distributions, which specify the quantitative relationship. • Formally, a Bayesian network defines a joint probability distribution over many variables (e.g., Input P(C; A; H; I)) via the local conditional distributions (e.g., p(i j a)). This joint distribution specifies all the information we know about how the world works. Bayesian network: P(X1 = x1;:::;Xn = xn) Evidence: E = e where E ⊆ X is subset of variables Query: Q ⊆ X is subset of variables Output P(Q = q j E = e) for all values q Example: if coughing but no itchy eyes, have a cold? P(C j H = 1;I = 0) CS221 / Spring 2019 / Charikar & Sadigh 3 • Think of the joint probability distribution defined by the Bayesian network as a guru. Probabilistic inference allows you to ask the guru anything: what is the probability of having a cold? What if I'm coughing? What if I don't have itchy eyes? In this lecture, we're going to build such a guru that can answer these Roadmap queries efficiently. Forward-backward Gibbs sampling Particle filtering CS221 / Spring 2019 / Charikar & Sadigh 5 • So far, we have computed various ad-hoc probabilistic queries on ad-hoc Bayesian networks by hand. We will now switch gears and focus on the popular hidden Markov model (HMM), and show how the Object tracking forward-backward algorithm can be used to compute typical queries of interest. • As motivation, consider the problem of tracking an object. The probabilistic story is as follows: An object starts at H1 uniformly drawn over all possible locations. Then at each time step thereafter, it transitions H1 H2 H3 H4 to an adjacent location with equal probability. For example, if H2 = 3, then H3 2 f2; 4g with equal probability. At each time step, we obtain a sensor reading Ei which is also uniform over locations adjacent to Hi. E1 E2 E3 E4 Problem: object tracking Hi 2 f1;:::;Kg: location of object at time step i Ei 2 f1;:::;Kg: sensor reading at time step i Start: p(h1): uniform over all locations Transition p(hi j hi−1): uniform over adjacent loc. Emission p(ei j hi): uniform over adjacent loc. CS221 / Spring 2019 / Charikar & Sadigh 6 • In principle, you could ask any type of query on an HMM, but there are two common ones: filtering and Hidden Markov model smoothing. • Filtering asks for the distribution of some hidden variable Hi conditioned on only the evidence up until that point. This is useful when you're doing real-time object tracking, and you can't see the future. • Smoothing asks for the distribution of some hidden variable Hi conditioned on all the evidence, including the future. This is useful when you have collected all the data and want to retroactively go and figure out H1 H2 H3 H4 H5 what Hi was. E1 E2 E3 E4 E5 n n Y Y P(H = h; E = e) = p(h1) p(hi j hi−1) p(ei j hi) | {z } i=2 | {z } i=1 | {z } start transition emission Query (filtering): P(H3 j E1 = e1;E2 = e2;E3 = e3) Query (smoothing): P(H3 j E1 = e1;E2 = e2;E3 = e3;E4 = e4;E5 = e5) CS221 / Spring 2019 / Charikar & Sadigh 8 • Now let's actually compute these queries. We will do smoothing first. Filtering is a special case: if we're asking for Hi given E1;:::;Ei, then we can marginalize out the future, reducing the problem to a smaller Lattice representation HMM. • A useful way to think about inference is returning to state-based models. Consider a graph with a start node, an end node, and a node for each assignment of a value to a variable Hi = v. The nodes are H1 =1 H2 =1 H3 =1 H4 =1 H5 =1 arranged in a lattice, where each column corresponds to one variable Hi and each row corresponds to a particular value v. Each path from the start to the end corresponds exactly to a complete assignment to the nodes. start H =2 H =2 H =2 H =2 H =2 end • Note that in the reduction from a variable-based model to a state-based model, we have committed to an 1 2 3 4 5 ordering of the variables. • Each edge has a weight (a single number) determined by the local conditional probabilities (more generally, the factors in a factor graph). For each edge into Hi = hi , we multiply by the transition probability into H1 =3 H2 =3 H3 =3 H4 =3 H5 =3 hi and emission probability p(ei j hi). This defines a weight for each path (assignment) in the graph equal to the joint probability P (H = h; E = e). 2 • Edge start ) H1 = h1 has weight p(h1)p(e1 j h1) • Note that the lattice contains O(Kn) nodes and O(K n) edges, where n is the number of variables and K is the number of values in the domain of each variable. • Edge Hi−1 = hi−1 ) Hi = hi has weight p(hi j hi−1)p(ei j hi) • Each path from start to end is an assignment with weight equal to the product of node/edge weights CS221 / Spring 2019 / Charikar & Sadigh 10 • The point of bringing back the search-based view is that we can cast the probability queries we care about Lattice representation in terms of sums over paths, and effectively use dynamic programming. • First, define the forward message Fi(v) to be the sum of the weights over all paths from the start node to Hi = v . This can be defined recursively: any path that goes Hi = hi will have to go through some H1 =1 H2 =1 H3 =1 H4 =1 H5 =1 Hi−1 = hi−1 , so we can so over all possible values of hi−1. • Analogously, let the backward message Bi(v) be the sum of the weights over all paths from Hi = v to the end node. start H1 =2 H2 =2 H3 =2 H4 =2 H5 =2 end • Finally, define Si(v) to be the sum of the weights over all paths from the start node to the end node that pass through the intermediate node Xi = v . This quantity is just the product of the weights of paths H1 =3 H2 =3 H3 =3 H4 =3 H5 =3 going into Hi = hi (Fi(hi)) and those leaving it (Bi(hi)). Forward: F (h ) = P F (h )w(h ; h ) i i hi−1 i−1 i−1 i−1 i sum of weights of paths from start to Hi = hi Backward: B (h ) = P B (h )w(h ; h ) i i hi+1 i+1 i+1 i i+1 sum of weights of paths from Hi = hi to end Define Si(hi) = Fi(hi)Bi(hi): sum of weights of paths from start to end through Hi = hi CS221 / Spring 2019 / Charikar & Sadigh 12 • Let us go back to the smoothing queries: P(Hi = hi j E = e). This is just gotten by normalizing Si. Lattice representation • The algorithm is thus as follows: for each node Hi = hi , we compute three numbers: Fi(hi);Bi(hi);Si(hi). First, we sweep forward to compute all the Fi's recursively. At the same time, we sweep backward to compute all the Bi's recursively. Then we compute Si by pointwise multiplication. Smoothing queries (marginals): • Implementation note: we technically can normalize Si to get P(Hi j E = e) at the very end but it's useful to normalize Fi and Bi at each step to avoid underflow. In addition, normalization of the forward messages yields P(Hi = v j E1 = e1;:::;Ei = ei) / Fi(v). P(Hi = hi j E = e) / Si(hi) Algorithm: forward-backward algorithm Compute F1;F2;:::;FL Compute BL;BL−1;:::;B1 Compute Si for each i and normalize Running time: O(LK2) CS221 / Spring 2019 / Charikar & Sadigh 14 Summary Roadmap • Lattice representation: paths are assignments (think state-based models) Forward-backward • Dynamic programming: compute sums efficiently Gibbs sampling • Forward-backward algorithm: share intermediate computations across different queries Particle filtering CS221 / Spring 2019 / Charikar & Sadigh 16 CS221 / Spring 2019 / Charikar & Sadigh 17 • The central idea to both of Gibbs sampling and particle filtering is the use of particles (just a fancy word Particle-based approximation for complete assignment, usually generated stochastically) to represent a probability distribution. • Rather than storing the probability of every single assignment, we have a set of assignments, some of which can occur multiple times (which implicitly represents a higher probability). Maintaining this small set of assignments will allows us to answer queries more quickly, but at a cost: our answers will now be P(X1;X2;X3) approximate instead of exact.

Load more