Graphical Models

Unit 11

Machine Learning University of Vienna Graphical Models

Bayesian Networks (directed graph) The Variable Elimination Algorithm Approximate Inference (The Gibbs Sampler ) Markov networks (undirected graph) Markov Random fields (MRF) Hidden markov models (HMMs) The Viterbi Algorithm Kalman Filter Simple 1

The graphs are the sets of nodes, together with the links between them, which can be either directed or not. If two nodes are not linked, than those two variables are independent.

The arrows denote causal relationships between nodes that represent features.

The probability of A and B is the same as the probability of A times the probability of B conditioned on A: P(a, b) = P(b|a)P(a) Simple graphical model 2

The nodes are separated into:

• observed nodes: where we can see their values directly • hidden or latent nodes: whose values we hope to infer, and which may not have clear meanings in all cases.

C is conditionally independent of B, given A Example: Exam Panic

Directed acyclic graphs (DAG) paired with the conditional probability tables are called Bayesian networks.

B - denotes a node stating whether the exam was boring R - whether or not you revised A - whether or not you attended lectures P - whether or not you will panic before the exam Example: Exam Panic

R P(r) P(¬r) P(b) P(¬b) T 0.3 0.7 0.5 0.5 F 0.8 0.2

RA P(p) P(¬p) A P(a) P(¬a) TT 0 1 T 0.1 0.9 TF 0.8 0.2 F 0.5 0.5 FT 0.6 0.4 FF 1 0

P The probability of panicking: P(p) = b,r,a P(b, r, a, p) = P b,r,a P(b) × P(r|b) × P(a|b) × P(p|r, a) Example: Exam Panic

Suppose you know that the course was boring, and want to work out how likely it is that you will panic before exam.

P(p|b) = 0.3×0.1×0+0.7×0.1×0.6+0.3×0.9×0.8+0.7×0.9×1 = 0.888

Suppose you know that the course was not boring, and want to work out how likely it is that you will panic before exam.

P(p| ∼ b) = 0.8×0.5×0+0.8×0.5×0.8+0.2×0.5×0.6+0.2×0.5×1 = 0.48

P(p) = P(p|b)P(b) + P(p| ∼ b)P(∼ b) = 0.5 × 0.888 + 0.5 × 0.48 = 0.684 Backward inference or diagnosis

Suppose you pank outside the exam. Why you are panicking - was it because you didn‘t come to the lectures, or because you didn‘t revise?

Bayer’s rule:

P(p|r)P(r) P b,aP(b,a,r,p) P(r|p) = P(p) = P(p) =

0.5×0.3(0.1×0+0.9×0.8)+0.5×0.8(0.5×0+0.5×0.8) = P(p) = = 0.268 = 0.684 = 0.3918 Bayes’ rule is the reason why this type of graphical model is known as a . Computational costs

For a graph with N nodes where each node can be either true or false the computational costs is O(2N ).

The problem of exact inference on Bayesian networks is NP-hard.

For polytrees where there is at most one path between any two nodes, the computational cost is linear in the size of the network.

Unfortunately, it is rare to find such polytrees in real examples, so we will consider approximate inference. Variable Elimination Algorithm

• With variable elimination algorithm one can speed things up a little by minimisation programm loops. • The conditional probability tables are converted into λ tables, which simply list all of the possible values for all variables and which initially contain the conditional probabilities:

RAP λ TTT 0 TTF 1 TFT 0.8 TFF 0.2 FTT 0.6 FTF 0.4 FFT 1 FFF 0 Variable Elimination algorithm

To eliminate R from the graph we do following calculation:     BR| λ RA| λ     TT | 0.3 TT | 0      TF | 0.7 TF | 0.8 ⇒         FT | 0.8 FT | 0.6 FF | 0.2 FF | 1   BA| λ   TT | 0.3 × 0 + 0.7 × 0.6 = 0.42   ⇒ TF | 0.3 × 0.8 + 0.7 × 1 = 0.94     FT | 0.8 × 0 + 0.2 × 0.6 = 0.12 FF | 0.8 × 0.8 + 0.2 × 1 = 0.84 Variable Elimination Algorithm I

create the λ tables: - for each variable v: * make a new table * for all possible true assignments x of the parent variables: - add rows for P(v|x) and 1 − P(v|x) to the table * add this table to the set of tables eliminate known variable v: - for each table * remove rows where v is incorrect * remove column for v from table Variable Elimination Algorithm II

eliminate other variable (where x is the variable to keep):

- for each variable v to be eliminated: * create a new table t0 * for each table t containing v: vtrue,t = vtrue,t × P(v|x) vfalse,t = vfalse,t × P(¬v|x) P * vtrue,t0 = (vtrue,t0 ) Pt * vfalse,t0 = t (vfalse,t0 ) - replace tables t with the new t0

calculate conditional probability:

- for each table:

* xtrue = xtrue × P(x)

* xfalse = xfalse × P(¬x)

* probability is xtrue /(xtrue + xfalse ) The Monto Carlo methods (MCMC)

sample from the hidden variables

- start at the top of the graph - sample from each of the known probability distributions

weight the samples by their likelihoods

In our example:

generate a sample from P(b) use that value in the conditional probability tables for ’R’ and ’A’ to compute P(r|b = sample value) and P(a|b = sample value) use these three values to sample from P(p|b, a, r), take as many samples as you like in this way Gibbs sampling

In MCMC we have to work throught the graph from top to bottom and select rows from the conditional probability tables that match the previous case. Better to sample from the unconditional destribution and reject any samples that don’t have the correct prior probability (rejection sampling). We can work out what evidence we already have and use this variable to assign likelihoods to the other variables that are sampled.

set values for all of the possible probabilities, based on either evidence or random choices.

find the probability distribution with Gibbs sampling Gibbs sampling

The probabilities in the network are: Q p(x) = j p(xj |xαj ),

where xαj are the parent nodes of xj . In a Bayesion network, any given variable is independent of any node that is not their child, given their parents: Q p(xj |x−j ) = p(xj |xαj ) k∈β(j) p(xk |xα(k)),

where β(j) is the set of children of node xj and x−j signifies all values of xi except xj .

For any node we only need to conside its parents, its children, and the other parents of the children. This is known as the Markov blanket of the node. The Gibbs Sampler

for each variable xj :

(0) - initialise xj

repeat

- for each variable xj :

(i+1) (i) (i) * sample x1 from p(x1|x2 , ··· , xn ) (i+1) (i+1) (i) * sample x2 from p(x2|x1 , ··· , xn ) * ... (i+1) (i+1) (i+1) * sample xn from p(xn|x1 , ··· ,n−1 )

until you have enough samples Markov Random Fields (MRF): image denoising

Markov property: the state of a particular node is a function only of the states of its immediate neighbours.

Binary image I with pixel values Ixi ,xj ∈ {−1, 1} has noise. We want to recover an ideal“ image I 0 that has no noise in it. ” xi ,xj If the noise is small, then there should be a good correlation between I and I 0 . xi ,xj xi ,xj Assume also that within a small ”‘patch”’ or region in an image,

there is a good correlation between pixels: Ixi ,xj should correlate

well with Ixi +1,xj , Ixi ,xj −1 etc. Ising model

The original theory of MRFs was worked out by physicists in ising model:

a statistic description of a set of atoms connected in a chain, where each can spin up (+1) or down (-1) and whose spin effects those connected to it in the chain.

Physicists tend to think of the energy of such systems. Stable states are those with the lowest energy, since the system needs to get extra energy if it wants to move out of this state. Markov Random Fields (MRF): image denoising

The energy of our pair of images must be low when the pixels match. The energy of the same pixel in two images: −ηI I 0 , xi ,xj xi ,xj where η is a positive constant.

The energy of two neighbouring pixels is −ζIxi ,xj Ixi +1,xj . The total energy:

N N X X 0 E(I , I 0) = −η I I − ζ I I 0 , xi ,xj xi ±1,xj ±1 xi ,xj xi ,xj i,j i,j

where the index of the pixels is assumed to run from 1 to N in both the x and y directions. The Image Denoising Algorithm

given a noisy image I and the original image I 0, together with parameters η, ζ:

loop over the pixels of image I :

- compute the energies with the curent pixel being −1 and 1

- pick the one with lower energy and set its value in I accordingly MRF example: a world map

Using the MRF image denoising algorithm with η = 2.1, ζ = 1.5 on a map of the world corrupted by 10% uniformly distributed random noise (left) gives image right which has about 4% error, although it has smoothed out the edges of all continents. Hidden Markov Models (HMMs)

The Hidden is one of the most popular graphical models. It is used in speech processing and in a lot of statistical work.

The HMM generally works on a set of temporal data.

At each clock tick the system moves into a new state, which can be the same as the previous one.

You see observations that do not uniquely identify the state. This is where the hidden in the title comes from.

The HMM is the simplest dynamic Bayesian network.

Generally is assumed that the markov chain is ergodic: it means that there is a non-zero probbility of reaching every state eventually, no matter what the starting state. Hidden Markov Models (HMMs)

There are four things that you can do in the evening:

go to the pub, watch TV, go to a party, study

I can do observations if you look tired, hungover, scared or fine (hidden states).

I don‘t know why you look the way you do, but I can guess by assigning probabilities to those things. Hidden Markov Models (HMMs)

The HMM itself is made up of the transition probabilities aij and P P the observation probabilities bjk : j aij = 1, k bjk = 1

TV Pub Party Study Previous night TV 0.4 0.3 0.1 0.2 Pub 0.6 0.05 0.1 0.25 Party 0.7 0.05 0.05 0.2 Study 0.3 0.4 0.25 0.05

Tired Hungover Scared Fine TV 0.2 0.1 0.2 0.5 Pub 0.4 0.2 0.1 0.3 Party 0.3 0.4 0.2 0.1 Study 0.3 0.05 0.3 0.35 Hidden Markov Models (HMMs)

After a couple a weeks of observations there are three things that I want to do with the data:

• see how well the sequence of observations that I’ve made match my current HMM

• work out the most probable sequence of states that you’ve been in based on my observation

• given several sets of observations (for example, by watching several students) generate a good HMM for the data. The Forward Algorithm

Suppose I see the following observations:

O = (tired, tired, fine, hungover, hungover, scared, hungover, fine)

The probability that my observations O = {o(1), ··· , o(T )} come from the model can be computed using simple conditional probability.

R X P(O) = P(O|Ωr )P(Ωr ) r=1

The r index describes a possible sequence of states, so Ω1 is one sequence, Ω2 another, and so on. The Forward Algorithm

We use the Markov property

QT QT P(Ωr ) = t=1 P(ωj (t)|ωi (t − 1)) = t=1 aij and QT QT P(O|Ωr ) = t=1 P(ok (t)|ωj (t)) = t=1 bjk

R T R T X Y X Y P(O) = P(ok (t)|ωj (t))P(ωj (t)|ωi (t − 1)) = bjk aij r=1 t=1 r=1 t=1 The Forward Trellis

A new variable αi (t) describes the probability that at time t the state is ωi and the first (t − 1) steps all matched the observations o(t):  0 t = 0, j 6= initial state  αj (t) = 1 t = 0, j = initial state .  P i αi (t − 1)aij bj(ot ) otherwise The Forward Trellis

αTV (0) = 0.25, αPub(0) = 0.25, αParty (0) = 0.25, αStudy (0) = 0.25

αTV (1) = (αTV (0)aTV ,TV + αPub(0)aPub,TV + αParty (0)aParty,TV + αStudy (0)aStudy,TV )bTV ,Tired = (0.25 ∗ 0.4 + 0.25 ∗ 0.3 + 0.25 ∗ 0.1 + 0.25 ∗ 0.2) ∗ 0.2 = 0.05

αPub(1) = (αTV (0)aTV ,Pub + αPub(0)aPub,Pub + αParty (0)aParty,Pub + αStudy (0)aStudy,Pub)bPub,Tired = (0.25 ∗ 0.6 + 0.25 ∗ 0.05 + 0.25 ∗ 0.1 + 0.25 ∗ 0.25) ∗ 0.4 = 0.1 The Forward Trellis

αParty (1) = (αTV (0)aTV ,Party + αPub(0)aPub,Party + αParty (0)aParty,Party + αStudy (0)aStudy,Party )bParty,Tired = (0.25 ∗ 0.7 + 0.25 ∗ 0.05 + 0.25 ∗ 0.05 + 0.25 ∗ 0.2) ∗ 0.3 = 0.075

αStudy (1) = (αTV (0)aTV ,Study + αPub(0)aPub,Study + αParty (0)aParty,Study + αStudy (0)aStudy,Study )bStudy,Tired = (0.25 ∗ 0.3 + 0.25 ∗ 0.4 + 0.25 ∗ 0.25 + 0.25 ∗ 0.05) ∗ 0.3 = 0.075 The HMM Forward Algorithm

For each observation in order ot , t = 1, ··· , T

- for each possible state s

X as (t) = bs(ot ) × (ax,t−1 × ax,s ) x The Viterbi Algorithm

For each timestep we pick the state that is most likely as the next step in the path, rather than maintaining probabiliies of all possible paths.

For each observation in order ot , t = 1, ··· , T

- for each possible state s

vs (t) = max (vx,t−1 × ax,s × bs(o )) x t

- path(t) = arg maxx (vx (t))

So path(1) = ”‘Pub”’ The Baum-Welch or Forward-Backward Algorithm

Unsupervised learning problem is to generate the HMM from sets of observations. We complement the forward algorithm with a variable β that take us backwards throught the HMM, i.e. βi (t) tells us the probability that at time t we are in state ωi and the result of the target sequence (times t + 1 to T ) will be generated correctly:  0 t = T , i 6= final state  βi (t) = 1 t = T , i = final state  P j βj (t + 1)aij bj(ot+1) otherwise We can run backwards throught the HMM from the known end point. The Backward Trellis

βTV (8) = 0.25, βPub(8) = 0.25, βParty (8) = 0.25, βStudy (8) = 0.25

βTV (7) = βTV (8)aTV ,TV bTV ,fine + βPub(8)aTV ,PubbPub,fine + βParty (8)aTV ,Party bParty,fine + βStudy (8)aTV ,Study bStudy,fine = 0.25 ∗ 0.4∗0.5+0.25∗0.3∗0.3+0.25∗0.1∗0.1+0.25∗0.2∗0.35 = 0.0925

βPub(7) = βTV (8)aPub,TV bTV ,fine + βPub(8)aPub,PubbPub,fine + βParty (8)aPub,Party bParty,fine + βStudy (8)aPub,Study bStudy,fine = 0.25 ∗ 0.6∗0.5+0.25∗0.05∗0.3+0.25∗0.1∗0.1+0.25∗0.25∗0.35 = 0.103125 The Backward Trellis

βParty (7) = βTV (8)aParty,TV bTV ,fine + βPub(8)aParty,PubbPub,fine + βParty (8)aParty,Party bParty,fine + βStudy (8)aParty,Study bStudy,fine = 0.25∗0.7∗0.5+0.25∗0.05∗0.3+0.25∗0.05∗0.1+0.25∗0.2∗0.35 = 0.11

βStudy (7) = βTV (8)aStudy,TV bTV ,fine + βPub(8)aStudy,PubbPub,fine + βParty (8)aStudy,Party bParty,fine + βStudy (8)aStudy,Study bStudy,fine = 0.25∗0.3∗0.5+0.25∗0.4∗0.3+0.25∗0.25∗0.1+0.25∗0.05∗0.35 = 0.078125 The Baum-Welch or Forward-Backward Algorithm

We can use these forwards and backwards estimates to compute transition probabilities. Suppose we want to compute the probability of a transition between state ωi at time t and ωj at time t + 1. We run forwards our current model via α to get to state ωi at time t and run backwards to get to state ωj at time t + 1 via β.

Then we use the current estimates of aij and bjk . We normalise this calculation by how likely this particular training sequence is according to the current model, which is P(O|aij , bjk ).

This value is usually called γij :

αi (t − 1)aij bjk βj (t) γij = P(O|aij , bjk ) The update rule for transition probabilities

PT t=1 γij (t) tells us how many times we can expect to transition from state ωi to state ωj at any time in the sequence.

We need to divide this number by the number of times we expect to transition out of state ωi , regardless of where we end up:

T X X γim(t) t=1 m

The update rule for aij :

PT γ (t) a = t=1 ij ij PT P t=1 m γim(t) The update rule for observation probabilities

We need to think about the frequency that an observation ok is made in state j compared to any other symbol:

PT P γkm(t) b = t=1,o(t)=ok m jk PT P t=1 m γjm(t) The HMM Baum-Welch Algorithm

while updates have not converged:

- E-step:

- Compute forwards and backwards steps (α and β)

- for each observation in order ot , t = 1 ··· T

* for each possible pair of states s and σ:

γσ,s,t = ασ,t × aσ,s × βs,t+1 × bs,o(t+1)/ maxx (αx,T −1) The HMM Baum-Welch Algorithm

- M-step:

- for each possible pair of states s and σ:

P P P * as,σ = t γs,σ,t / y t γs,σ,t

- for each observation o:

* for each state s: P P tally = t y γs,x,t

bs,o = sum(tally where observation o was seen) /total tally Tracking Methods

The Kalman Filter

The state, which is hidden consists of the variables that we want to know, which we see throught noisy observation over time.

makes an estimate of the next step,

computes an eror term based on the value that was actually produced in the next step and

tries to correct it then uses both of those to make the next prediction and

iterates this procedure. The Kalman Filter

Process is linear and all of the distributions are Gaussian with constant covariance Q and R: X ∼ N (0, 1), so gilt The transition model (A): P(xt+1|xt ) = N (xt+1|Axt , Q) The observation model (H):

P(zt+1|xt+1) = N (zt+1|Hxt+1, R)

Predicted observation:

bzt+1 = HAxt+1

The error: zt+1 − HAxt+1

Σt is the covariance matrix of xt :

T Σbt+1 = AΣt A + Q is the covariance matrix of xt+1 The Kalman gain

The Kalman filter weights these error computations by how much trust the filter currently has in its predictions:

T T −1 Kt+1 = Σbt+1H (HΣbt+1H + R)

The update for the estimate is

xt+1 = bxt+1 + Kt+1(zt+1 − Hxt+1)

The update of covariance matrix:

Σt+1 = (I − Kt+1H)Σbt+1 The Kalman Filter Algorithm

given an initial estimate x(0)

for each timestep:

- predict the next step

* predict state as bxt+1 = AXt T * predict covariance Σbt+1 = AΣt A + Q

- update the estimate

* compute the error in the estimate,  = zt+1 − HAxt+1

T T −1 * compute the Kalman gain Kt+1 = Σbt+1H (HΣbt+1H + R)

* update the state xt+1 = bxt+1 + Kt+1(zt+1 − Hxt+1)

* update the covariance Σt+1 = (I − Kt+1H)Σbt+1 Tracking problem

x - position, y- velocity of the object.

0 T xt = (y, y )

The update equation:

xt+1 = Axt + Bat+1, where the acceleration at is a N(0, σ) ! ! 1 ∆t 1 ∆t2 A = , B = 2 0 1 ∆t Example: Tracking problem