Graphical Models
Total Page:16
File Type:pdf, Size:1020Kb
Graphical Models Unit 11 Machine Learning University of Vienna Graphical Models Bayesian Networks (directed graph) The Variable Elimination Algorithm Approximate Inference (The Gibbs Sampler ) Markov networks (undirected graph) Markov Random fields (MRF) Hidden markov models (HMMs) The Viterbi Algorithm Kalman Filter Simple graphical model 1 The graphs are the sets of nodes, together with the links between them, which can be either directed or not. If two nodes are not linked, than those two variables are independent. The arrows denote causal relationships between nodes that represent features. The probability of A and B is the same as the probability of A times the probability of B conditioned on A: P(a; b) = P(bja)P(a) Simple graphical model 2 The nodes are separated into: • observed nodes: where we can see their values directly • hidden or latent nodes: whose values we hope to infer, and which may not have clear meanings in all cases. C is conditionally independent of B, given A Example: Exam Panic Directed acyclic graphs (DAG) paired with the conditional probability tables are called Bayesian networks. B - denotes a node stating whether the exam was boring R - whether or not you revised A - whether or not you attended lectures P - whether or not you will panic before the exam Example: Exam Panic R P(r) P(:r) P(b) P(:b) T 0.3 0.7 0.5 0.5 F 0.8 0.2 RA P(p) P(:p) A P(a) P(:a) TT 0 1 T 0.1 0.9 TF 0.8 0.2 F 0.5 0.5 FT 0.6 0.4 FF 1 0 P The probability of panicking: P(p) = b;r;a P(b; r; a; p) = P b;r;a P(b) × P(rjb) × P(ajb) × P(pjr; a) Example: Exam Panic Suppose you know that the course was boring, and want to work out how likely it is that you will panic before exam. P(pjb) = 0:3×0:1×0+0:7×0:1×0:6+0:3×0:9×0:8+0:7×0:9×1 = 0:888 Suppose you know that the course was not boring, and want to work out how likely it is that you will panic before exam. P(pj ∼ b) = 0:8×0:5×0+0:8×0:5×0:8+0:2×0:5×0:6+0:2×0:5×1 = 0:48 P(p) = P(pjb)P(b) + P(pj ∼ b)P(∼ b) = 0:5 × 0:888 + 0:5 × 0:48 = 0:684 Backward inference or diagnosis Suppose you pank outside the exam. Why you are panicking - was it because you didn`t come to the lectures, or because you didn`t revise? Bayer's rule: P(pjr)P(r) P b;aP(b;a;r;p) P(rjp) = P(p) = P(p) = 0:5×0:3(0:1×0+0:9×0:8)+0:5×0:8(0:5×0+0:5×0:8) = P(p) = = 0:268 = 0:684 = 0:3918 Bayes' rule is the reason why this type of graphical model is known as a Bayesian network. Computational costs For a graph with N nodes where each node can be either true or false the computational costs is O(2N ). The problem of exact inference on Bayesian networks is NP-hard. For polytrees where there is at most one path between any two nodes, the computational cost is linear in the size of the network. Unfortunately, it is rare to find such polytrees in real examples, so we will consider approximate inference. Variable Elimination Algorithm • With variable elimination algorithm one can speed things up a little by minimisation programm loops. • The conditional probability tables are converted into λ tables, which simply list all of the possible values for all variables and which initially contain the conditional probabilities: RAP λ TTT 0 TTF 1 TFT 0.8 TFF 0.2 FTT 0.6 FTF 0.4 FFT 1 FFF 0 Variable Elimination algorithm To eliminate R from the graph we do following calculation: 0 1 0 1 BRj λ RAj λ B C B C BTT j 0:3C BTT j 0 C B C B C BTF j 0:7C BTF j 0:8C ) B C B C B C B C @FT j 0:8A @FT j 0:6A FF j 0:2 FF j 1 0 1 BAj λ B C BTT j 0:3 × 0 + 0:7 × 0:6 = 0:42C B C ) BTF j 0:3 × 0:8 + 0:7 × 1 = 0:94C B C B C @FT j 0:8 × 0 + 0:2 × 0:6 = 0:12A FF j 0:8 × 0:8 + 0:2 × 1 = 0:84 Variable Elimination Algorithm I create the λ tables: - for each variable v: * make a new table * for all possible true assignments x of the parent variables: - add rows for P(vjx) and 1 − P(vjx) to the table * add this table to the set of tables eliminate known variable v: - for each table * remove rows where v is incorrect * remove column for v from table Variable Elimination Algorithm II eliminate other variable (where x is the variable to keep): - for each variable v to be eliminated: * create a new table t0 * for each table t containing v: vtrue;t = vtrue;t × P(vjx) vfalse;t = vfalse;t × P(:vjx) P * vtrue;t0 = (vtrue;t0 ) Pt * vfalse;t0 = t (vfalse;t0 ) - replace tables t with the new t0 calculate conditional probability: - for each table: * xtrue = xtrue × P(x) * xfalse = xfalse × P(:x) * probability is xtrue =(xtrue + xfalse ) The Markov Chain Monto Carlo methods (MCMC) sample from the hidden variables - start at the top of the graph - sample from each of the known probability distributions weight the samples by their likelihoods In our example: generate a sample from P(b) use that value in the conditional probability tables for 'R' and 'A' to compute P(rjb = sample value) and P(ajb = sample value) use these three values to sample from P(pjb; a; r), take as many samples as you like in this way Gibbs sampling In MCMC we have to work throught the graph from top to bottom and select rows from the conditional probability tables that match the previous case. Better to sample from the unconditional destribution and reject any samples that don't have the correct prior probability (rejection sampling). We can work out what evidence we already have and use this variable to assign likelihoods to the other variables that are sampled. set values for all of the possible probabilities, based on either evidence or random choices. find the probability distribution with Gibbs sampling Gibbs sampling The probabilities in the network are: Q p(x) = j p(xj jxαj ); where xαj are the parent nodes of xj . In a Bayesion network, any given variable is independent of any node that is not their child, given their parents: Q p(xj jx−j ) = p(xj jxαj ) k2β(j) p(xk jxα(k)); where β(j) is the set of children of node xj and x−j signifies all values of xi except xj . For any node we only need to conside its parents, its children, and the other parents of the children. This is known as the Markov blanket of the node. The Gibbs Sampler for each variable xj : (0) - initialise xj repeat - for each variable xj : (i+1) (i) (i) * sample x1 from p(x1jx2 ; ··· ; xn ) (i+1) (i+1) (i) * sample x2 from p(x2jx1 ; ··· ; xn ) * ... (i+1) (i+1) (i+1) * sample xn from p(xnjx1 ; ··· ;n−1 ) until you have enough samples Markov Random Fields (MRF): image denoising Markov property: the state of a particular node is a function only of the states of its immediate neighbours. Binary image I with pixel values Ixi ;xj 2 {−1; 1g has noise. We want to recover an ideal\ image I 0 that has no noise in it. " xi ;xj If the noise is small, then there should be a good correlation between I and I 0 . xi ;xj xi ;xj Assume also that within a small "`patch"' or region in an image, there is a good correlation between pixels: Ixi ;xj should correlate well with Ixi +1;xj ; Ixi ;xj −1 etc. Ising model The original theory of MRFs was worked out by physicists in ising model: a statistic description of a set of atoms connected in a chain, where each can spin up (+1) or down (-1) and whose spin effects those connected to it in the chain. Physicists tend to think of the energy of such systems. Stable states are those with the lowest energy, since the system needs to get extra energy if it wants to move out of this state. Markov Random Fields (MRF): image denoising The energy of our pair of images must be low when the pixels match. The energy of the same pixel in two images: −ηI I 0 , xi ;xj xi ;xj where η is a positive constant. The energy of two neighbouring pixels is −ζIxi ;xj Ixi +1;xj . The total energy: N N X X 0 E(I ; I 0) = −η I I − ζ I I 0 ; xi ;xj xi ±1;xj ±1 xi ;xj xi ;xj i;j i;j where the index of the pixels is assumed to run from 1 to N in both the x and y directions. The Markov Random Field Image Denoising Algorithm given a noisy image I and the original image I 0, together with parameters η; ζ: loop over the pixels of image I : - compute the energies with the curent pixel being −1 and 1 - pick the one with lower energy and set its value in I accordingly MRF example: a world map Using the MRF image denoising algorithm with η = 2:1; ζ = 1:5 on a map of the world corrupted by 10% uniformly distributed random noise (left) gives image right which has about 4% error, although it has smoothed out the edges of all continents.